Neutralizing Gender Bias in Word Embedding with Latent Disentanglement and Counterfactual Generation

Recent research demonstrates that word embeddings, trained on the human-generated corpus, have strong gender biases in embedding spaces, and these biases can result in the discriminative results from the various downstream tasks. Whereas the previous methods project word embeddings into a linear subspace for debiasing, we introduce a Latent Disentanglement method with a siamese auto-encoder structure with an adapted gradient reversal layer. Our structure enables the separation of the semantic latent information and gender latent information of given word into the disjoint latent dimensions. Afterwards, we introduce a Counterfactual Generation to convert the gender information of words, so the original and the modified embeddings can produce a gender-neutralized word embedding after geometric alignment regularization, without loss of semantic information. From the various quantitative and qualitative debiasing experiments, our method shows to be better than existing debiasing methods in debiasing word embeddings. In addition, Our method shows the ability to preserve semantic information during debiasing by minimizing the semantic information losses for extrinsic NLP downstream tasks.


Introduction
Recent researches have disclosed that word embeddings contain unexpected bias in their geometry on the embedding space (Bolukbasi et al., 2016;Zhao et al., 2019). The bias reflects unwanted stereotypes such as the correlation between gender 1 and occupation words. Bolukbasi et al. (2016) enumerated that the automatically generated analogies of (she, he) in the Word2Vec (Mikolov et al., 2013b) show the gender biases in significant level. An : Gender biased (purple) word embedding, gender-counterfactual (red) word embedding, and Neutralized (gray) word embedding Figure 1: The process view of our method. We can improve the embedding space from (a) to (b) with a better-aligned structure between gender word pairs by the proposed latent disentanglement. Afterwards, (c) we generate the gender-counterfactual embedding of the gender-biased word while keeping a geometrically aligned relationship with the gender word pairs to guarantee that the pair of word embeddings only differs from gender information, not hurting semantic information. (d) We obtain the gender-neutralized word embedding by interpolating the embedding from the pair of original-counterfactual word embeddings.
example of the analogies is the relatively closer distance of she to nurse; and he to doctor. Garg et al. (2018) demonstrated that the embeddings, from Word2Vec (Mikolov et al., 2013a) to Glove (Pennington et al., 2014), have strong associations between value-neutral words and population-segment words, i.e. a strong association between housekeeper and Hispanic. This unwanted bias can cause biased results in the downstream tasks (Caliskan et al., 2017a;Kiritchenko and Mohammad, 2018;Bhaskaran and Bhallamudi, 2019) and gender discrimination in NLP systems.
From the various gender debiasing methods for pre-trained word embeddings, the widely recognized method is a post-processing method, which projects word embeddings to the space that is or-thogonal to the gender direction vector defined by a set of gender word pairs. However, if the gender direction vector includes a component of semantic information 2 , the semantic information will be lost through the post-processing projections.
To balance between the gender debiasing and the semantic information preserving, we propose an encoder-decoder framework that disentangles a latent space of a given word embedding into two encoded latent spaces: the first part is the gender latent space, and the second part is the semantic latent space that is independent to the gender information.
To disentangle the latent space into two sub-spaces, we use a gradient reversal layer by prohibiting the inference on the gender latent information from the semantic information. Then, we generate a counterfactual word embedding by converting the encoded gender latent into the opposite gender. Afterwards, the original and the counterfactual word embeddings are geometrically interpreted to neutralize the gender information of given word embeddings, see Figure 1 for the illustration on our debiasing method.
Our contributions are summarized as follows: • We propose a method for disentangling the latent information of the word embedding by utilizing the siamese auto-encoder structure with an adapted gradient reversal layer.
• We propose a new gender debiasing method, which transforms the original word embedding into gender-neutral embedding, with the gender-counterfactual word embedding.
• We propose a generalized alignment with a kernel function that enforces the embedding shift, during the debiasing process, in a direction that does not damage the semantics of word embedding.
We evaluated the proposed method and other baseline methods with several quantitative and qualitative debiasing experiments, and we found that the proposed method shows significant improvements from the existing methods. Additionally, the results from several NLP downstream tasks show that our proposed method minimizes performance degradation than the existing methods.
We can divide existing gender debiasing mechanisms for word embeddings into two categories. The first mechanism is neutralizing the gender aspect of word embeddings in the training procedure. Zhao et al. (2018) proposed the learning scheme to generate a gender-neutral version of Glove, called GN-Glove, which forces preserving the gender information in pre-specified embedding dimensions while other embedding dimensions are inferred to be gender-neutral. However, learning new word embeddings for large-scale corpus can be difficult and expensive.
The second mechanism post-processes trained word embeddings to debias them after the training. An example of such post-processings is a linear projection of gender-neutral words toward a subspace, which is orthogonal to the gender direction vector defined by a set of gender-definition words (Bolukbasi et al., 2016). Another way of constructing the gender direction vector is using common names, e.g. john, mary, etc (Dev and Phillips, 2019), while the previous approach used gender pronouns, such as he and she. In addition to the linear projections, Dev and Phillips (2019) utilizes other alternatives, such as flipping and subtraction, to reduce the gender bias more effectively. Beyond simple projection methods, Kaneko and Bollegala (2019) proposed a neural network based encoder-decoder framework to add a regularization on preserving the gender-related information in feminine and masculine words.

Methodology
Our model introduces 1) the siamese network structure (Bromley et al., 1994;Weston et al., 2012) with an adapted gradient reversal layer for latent disentanglement and 2) the counterfactual data augmentation with geometric regularization for gender debiasing. We process the gender word pairs through the siamese network with auxiliary classifiers to reflect the inference of gender latent dimensions. Afterwards, we debias the gender-neutral words by locating it to be at the middle between a reconstructed pair of original gender latent variable and counterfactually generated gender latent variable.
Same as previous researches (Kaneko and Bollegala, 2019), we divide a whole set of vocabulary V into three mutually exclusive categories : feminine Gen-GRL Gen-GRL Figure 2: The framework overview of our proposed model. We characterize specialized regularization and network parameters with colored dotted lines and boxes with blue color, respectively.
word set V f ; masculine word set V m ; and gender neutral word set V n , such that In most cases, words in V f and V m exist in pairs, so we denote Ω as the set of feminine and masculine word pairs, such that (w f , w m ) ∈ Ω. Figure 2 illustrates the overall structure of our proposed method for pre-trained word embeddings, which we named Counterfactual-Debiasing, or CF-Debias. Eq. (1) specifies the entire loss function of the whole network parameters in Figure 2. The entire loss function is divided into two types of losses: L ld to be a loss for disentanglement and L cf to be a loss for counterfactual generation. λ can be seen as a balancing hyper-parameter between two-loss terms.

Overall Model Structure
Here, we use pre-trained word embeddings In the encoder-decoder framework, we denote the latent variable of w i to be z i ∈ R l , which is mapped to the latent space by the encoding function, E : w i → z i ; and the decoding function, D : z i →ŵ i . After the disentanglement of the latent space, z i is divided into two parts, such that z i = [z s i , z g i ] : z s i ∈ R l−k is the semantic latent variable of w i ; and z g i ∈ R k is the gender latent variable of w i , where k is the pre-defined value for the gender latent dimension. 3

Siamese Auto-Encoder for Latent Disentanglement
This section provides the construction details of L ld . Eq. (2) defines the objective function for latent disentanglement as a linearly-weighted sum of the losses.
For the disentanglement, our fundamental assumption is maintaining the identical semantic information in z s for the gender word pairs, (w f , w m ) ∈ Ω. Under this assumption, we introduce a latent disentangling method by utilizing the siamese auto-encoder with gender word pairs. The data structure of the gender word pairs provide an opportunity to adapt the siamese auto-encoder structure because the gender word pairs almost always have two words in pair 4 . Semantic Latent Formulation First, we regularize a pair of semantic latent variables (z s f , z s m ), from a gender word pair, (w f , w m ), to be same by minimizing the squared 2 distance as Eq. (3), since the semantic information of a gender word pair should be the same regardless of the gender. Gender Latent Formulation To formulate the gender-dependent latent dimensions, we introduce an auxiliary gender classifier, C r : z g → [0, 1], given in Eq. (4), and C r is asked to produce one in masculine words, labeled as g m = 1, and to produce zero in feminine words, g f = 0, respectively. After training, the output of C r can be an indicator of the gender information for each word. 5 Disentanglement of Semantic and Gender Latent The above two regularization terms do not guarantee the independence between the semantic and the gender latent dimensions. To enforce the independence between two latent dimensions, we introduce a Generator with Gradient Reversal Layer (GRL), C a : z s → z g (Ganin et al., 2016), which generates the gender latent dimension with the semantic latent dimension. We modify the flipping gradient idea of (Ganin et al., 2016) to the latent disentanglement between the semantic and the gender latent dimensions. The sufficient generation of z g from z s means that z s has enough information on z g , so the generation should be prohibited to make z g and z s independent. Hence, our feedback of the gradient reversal layer is maximizing the loss of generating z g from z s , which is represented as L di in Eq. (5).
In the learning stage, the gradient of the encoder for z s , which is parameterized as θ s , becomes the summation of 1) ∂Ls ∂θs , which is the gradient for the loss L s , the latent disentanglement losses of the encoder for z s excluding L di ; and 2) −λ a ∂L di ∂θs , which is the λ a -weighted negative gradient of the loss L di which is reversed after passing the GRL, because we intend to train the encoder for z s by preventing the generation of z g . Eq. (5) specifies the loss function for the disentanglement by GRL, and Eq. (6) specifies the reversed gradient, see Figure 3.
Reconstruction We add the reconstruction loss given in Eq. (7) for this encoder-decoder framework.

Gender-Counterfactual Generation
This section provides the construction details of L cf . Same as L ld , We define the objective function for the counterfactual generation as the linearlyweighted sum of the losses, introduced in this section, as in Eq. (8).
Unlike the gender word pairs, a word in the gender neutral word set w n ∈ V n utilizes a counterfactual generator, C g : z g n → ¬z g n , which converts the original gender latent, z g n , to the opposite gender, ¬z g n . It should be noted that C g is only activated for optimizing the losses in L cf , which assumes that other parameters learned for the latent disentanglement are freezed.
To switch z g n , we utilize a prediction from the gender classifier, C r , which is trained through the disentanglement loss. The modification loss, L mo , originates from indicating the opposite gender with z g n by C r , see Eq. (9). For instance, if C r returns 0.8 for the original gender latent, z g n , then we regularize the virtually generated gender latent, ¬z g n , to lead C r to return 0.2.
While Eq. (9) focuses on the gender latent switch, Eq. (10) emphasizes the minimal change of the gender latent, z g n . The combination of these two losses guides to the switched gender latent variable that is close to the original gender latent variable for regularizing the counterfactural generation. (10) Though we keep the semantic latent variable, z s , and switch the gender latent variable, z g , to generate the gender-counterfactual word embedding, their concatenation during decoding can be vulnerable to the semantic information changes because of variances in the individual latent variables. Consequently, we constrain that the reconstructed word embedding with the counterfactual gender latent, w cf , differs only in the gender information from w n , which is the reconstructed word embedding with the original gender latent. Linear Alignment For this purpose, we introduce the linear alignment, which regularizesŵ n −ŵ cf by measuring the alignment to the gender direction vector v g in Eq. (11), which is an averaged gender difference vector from the gender word pairs.
This regularization suggests that we constrain the embedding shift of the gender-neutral word to be the direction of v g . This alignment can be accomplished by maximizing the absolute inner product betweenŵ n −ŵ cf and v g as given in Eq. (12). We introduce CF -Debias-LA, which adds the below linear alignment regularization, λ la L la , to L cf .
Kernelized Alignment While the linear alignment computes the gender direction vector v g as a simple average, the gender information of word embedding can have a nonlinear structure. Therefore, we introduce the kernelized alignment, which enables the nonlinear alignment between 1)ŵ i m −ŵ i f of each gender word pair (w i f , w i m ) and 2)ŵ n −ŵ cf of gender-neutral words w n .
We hypothesize a nonlinear mapping function f , which projects a word embedding We can utilize the kernel trick (Schölkopf et al., 1998) for computing pairwise operation on the nonlinear space introduced by f . Let k(w, w ) = f (w) · f (w ) be a kernel representing an innerproduct of two vectors in the feature space. Also, we set φ k to be k-th eigenvector for the projected outputs of the given embeddings {f (w i )} N i=1 . By following Appendix A, P C k is the k-th principal component of new word embedding w on the introduced feature space: P C k = f (w ) · φ k . Then, we find the k-th principal component for embedding w as given in Eq. (15), when a i k is i-th component of k-th eigenvector of K, which is a N × N kernel matrix of given data.
Substituting the inner product in Eq. (12) with Eq. (14), we design the nonlinear alignment between the gender difference vector,ŵ m −ŵ f , and the gender neutral vector,ŵ n −ŵ cf , by maximizing the Top-K kernel principal components as Eq. (14). We introduce CF -Debias-KA, which adds the kernelized alignment regularization, λ ka L ka , to L cf . We use Radial Basis Function kernel for our experiment.

Post-Processing by the Word's Category
After learning the network parameters, we postprocess words by its categories of V f , V m , and V n . We gender-neutralize the embedding vector of w n ∈ V n by relocating the vector to the middle point of the reconstructed original-counterfactual pair embeddings, such that w :=ŵ cf +ŵn 2 =ŵ neu . We utilize a reconstructed word embedding which preserves the gender information in embedding we can safely preserve gender information of given word by using reconstructed embedding such that w :=ŵ.

Datasets and Experimental Settings
We used the set of gender word pairs created by Zhao et al. (2018) as V f and V m , respectively. All models utilize GloVe on 2017 January dump of English Wikipedia with 300-dimension embeddings for 322,636 unique words. Additionally, to investigate the debiasing effect on languages other than      Table 1: Percentage of predictions of each category on sembias analogy task, for each language. † and * denote the statistically significant differences for Hard-Debias and Original embedding, respectively. The best model is indicated as boldface. We denote "-" for the skipped cases, whose methods are closely tied to GloVe embedding.
English; we conducted one of the debiasing experiments for Spanish, which is the Subject-Verb-Object language as English; and Korean, one of the Subject-Object-Verb language. We used Fasttext (Bojanowski et al., 2016) for experiments of Spanish and Korean. Accordingly, we excluded the baselines, whose methods are closely tied to GloVe, for the experiments of other languages. We specify the dimensions of z, l, as 300, which is divided into 295 semantic latent dimensions and 5 gender latent dimensions. Also, we utilize the sequential hyperparameter schedule, which updates the weight for L ld more at the initial step and gradually increases updating the weight for the L cf , by changing λ in Eq. (1) from 1 to 0. Further information on experimental settings can be found in Appendix G.

Baselines
We compare our proposed model with below baseline models, and we utilize the authors' implementations.  6 We provided link of the authors' implementations in Appendix H. 7 We use the subtraction method as an ATT-Debias.
respectively. Besides, GP-Debias and GP-GN-Debias adopt additional losses to neutralize gender bias and preserve gender information for genderdefinition words.

Sembias Analogy Test
We perform the Sembias gender analogy test (Zhao et al., 2018;Jurgens et al., 2012) to evaluate the degree of gender bias in embeddings. The Sembias dataset in English contains 440 instances, and each instance consists of four-word pairs : 1) a genderdefinition word pair (Def), 2) a gender-stereotype word pair (Stereo), and 3,4) two none-type word pairs (None). We test models by calculating the linear alignment between each word pair difference vector, − → a − − → b ; and − → he − −→ she, which we refer to as Gender Direction. This test regards an embedding model to be better debiased if the alignment is larger for the word pair of Def compared to the word pairs of None and Stereo. By following the past practices, we test models with 40 instances from a subset of Sembias, whose gender word pairs are not used for training. To investigate the result of Sembias analogy test in Spanish and Korean, we translated the words in Sembias into the other languages with human corrections. Table 1 shows the percentages of the largest alignment with Gender Direction for all instances. For English, CF-Debias-LA selects all the pairs of Def, which shows the sufficient maintenance of the gender information for those words. Also, CF-Debias-LA selects neither stereotype nor nonetype words, so the difference vectors of Stereo and None always have less alignment to Gender Direction than the difference vectors of Def. We further refer to the experimental settings of Spanish and Korean in Appendix J.  Table 2: WEAT hypothesis test results for five gender-stereotypical word categories. The best and second-best models are indicated as boldface and underline, respectively. The absolute value of the effect size denotes the degree of bias. A value of d closer to 0 means that there is no gender bias.

WEAT
We apply the Word Embedding Association Test (WEAT) (Caliskan et al., 2017b) for debiasing test. WEAT uses permutation test to compute the effect size (d) and p-value in Table 2, as a measurement of the bias in word embeddings. The effect size computes differential association of the sets of stereotypical target words, i.e. career vs family, and the gender word pair sets from Chaloner and Maldonado (2019a). A higher value of effect size indicates a higher gender bias between the two sets of target words. The p-value is used to check the significant level of bias. We provide the detailed description of WEAT in Appendix C. The variations of our method show the best performances for whole categories except math vs art, see Table  2.  Table 3: Human-based evaluation for the gender bias and semantics of generated analogy, with standard deviation. The best model is indicated as boldface.

Analogy Test with Human based Validation
We conducted a human experiment on the analogy generated by the debiased embeddings to evaluate the debiasing efficacy of each embedding. each embeddings generate a word based on the ques-tion "a is to b as c is to what?", when words a, b are selected from the gender word pairs of Sembias dataset; and c is given as a gender stereotypical word, i.e. homemaker, housekeeper, from Bolukbasi et al. (2016). The answer word from each question is generated by argmax . 18 Human subjects were asked to evaluate the generated analogies from two perspectives; 1) existence of gender bias in the analogy, 2) semantic validity of the analogy. 8 Table 3 shows that our method indicates the least gender bias while competitively maintaining the semantic validity.

Debiasing Qualitative Analysis
To demonstrate the indirect gender bias in the word embedding, we perform two qualitative analyses from Gonen and Goldberg (2019). We take the top 500 male-biased words and the top 500 femalebiased words, which becomes a word collection of the top 500 and the bottom 500 inner product between the word embeddings and − → he − −→ she. From the debiasing perspective, these 1,000 word vectors should not be clustered distinctly. Therefore, we create two clusters with K-means and check the heterogeneity of the clusters through the cluster majority classification. The left side on Figure  4 shows that CF-Debias-KA generates a genderinvariant embedding for gender-biased wordsets by showing the lowest cluster classification accuracy.
Gonen and Goldberg (2019) demonstrates that the original bias 9 has a high correlation with 8 We enumerate the embeddings utilized in an experiment and detailed description of the human experiment in Appendix I. 9 the dot-product between the original word embedding from GloVe and − → he − −→ she

Downstream Task of Debiased Word Embeddings
We compared multiple downstream task performances of the original and the debiased word embeddings, to check the ability to preserve semantic information in debiasing procedures. Following CoNLL 2003 shared task (Sang and Erik, 2002), we selected Part-Of-Speech tagging, Part-Of-Speech chunking, and Named Entity Resolution as our tasks. Table 4 shows that there are constant performance degradation effects for all debiasing methods from the original embedding. However, our methods minimized the degradation of performances across the baseline models. Especially, CF-Debias-KA shows the minimal performance degradations by utilizing the nonlinear alignment regularization.

Analyses on Alignment Regularization
If the difference vectors of gender word pairs are not linearly aligned, the gender direction vector v g in Eq. (11) cannot be a pure direction of the gender information. Hence, we compared the variances explained by the top 30 principal components (P C) of difference vectors for gender word pairs, as a measurement for the linear alignment. The left plot in Figure 5 shows the proportion of variances from each P C. Our method shows the largest concentration of the variances on a few components, other than Hard-Debias and Original embedding. The right plot in Figure 5 shows Gini-index (Gini, 1912) for the variance proportion vector from P Cs. Our method shows minimal Gini-index, which indicates the monopolized proportion of variances. Also, Figure 6 shows two example plots of a selected gender word pairs in the original embedding space (Upper) and the CF-Debias-LA embedding space (Lower), by Locally Linear Embedding (LLE), (Roweis and Saul, 2000). The lower plot in Figure 6 shows the consistency of the gender direction, and the plot visually describes the neutralization of housekeeper, statistician by utilizing the counterfactually augmented word embeddings.

Conclusions
This work contributes to natural language processing society in two folds. For gender debiasing application, our model produces the debiased embeddings that has the most neutral gender latent information as well as the efficiently maintained semantics for the various NLP downstream tasks. For methodological modeling, CF-Debias suggests a new method of disentangling the latent information of word embeddings with the gradient reversal layer and creating the counterfactual embeddings by exploiting the geometry of the embedding space. It should be noted that these types of latent modeling can be applied to diverse natural language tasks to control expressions on emotions, prejudices, ideologies, etc.

A The Derivation of Principal Component on Kernelized Alignment
Let's assume that we want to align a word embedding w to the set of the word embeddings {w i } N i=1 . Then, we introduce nonlinear mapping function f , which projects a word embedding w i ∈ R d into a newly introduced feature space, f (w i ) ∈ R m . If we assume that the mapped outputs from the word embeddings {f (w i )} N i=1 are zero-centered, the covariance matrix can be estimated as follows: Same as the main paper, we set φ k and λ k to be k-th eigenvector and eigenvalue for the projected outputs of the given embeddings {f (w i )} N i=1 , respectively. Then, we can get following equation, which describes the eigen-decomposition of the covariance matrix.
From above function, φ k can be represented as a linearly-weighted combination of the N mapped outputs of word embeddings as follows: Then, we multiply f (w j ) for j = 1, ..., N to both sides of the equation.
We can replace an inner-product of the two mapped outputs, (f (w i ) · f (w j )), into kernel k(w i , w j ), which represents an inner product of two vectors in the projected space, for the case when computing mapped results of given data is complex or impossible.
The above equation can be represented as the j-th component of the k-th eigenvector-decomposition problem of K, which is a matrix of N × N kernel elements k(w i , w j ) for i, j = 1, ..., N . See the below equation, which is k-th eigenvectordecomposition problem of K, when a k = [a 1 k , ..., a N k ] T .
This implication means that a j k is j-th component of k-th eigenvector of K and we can compute a j k by solving eigen-decomposition problem of K.
Substituting f (w j ) on above equation into f (w ), which is mapped result of the target word embedding w , we get P C k , k-th principal component of new word embedding w on the projected space as follows: It should be noted that above derivation is based on Schölkopf et al. (1998). The proposed Kernelized alignment can be seen as an example which applies an nonlinear alignment to the word embeddings, by utilizing the kernel trick provided from Schölkopf et al. (1998).

B Notation table
Notation Description w f The embedding of feminine word w m The embedding of masculine word w n The embedding of gender neutral word V f The feminine word set V m The masculine word set V n The gender neutral word set The reconstructed word embedding of w f w m The reconstructed word embedding of w m w n The reconstructed word embedding of w n w cf The counterfactually reconstructed word embeddinĝ w neu The gender neutralized word embedding g f The output of gender classifier for z g f g m The output of gender classifier for z g m v g The gender direction vector Ω The gender word pairs set E The encoder of our method D The decoder of our method C r The auxilary gender classifier C a The gender latent generator

C WEAT Hypothesis test
WEAT hypothesis (Caliskan et al., 2017b) test quantifies the bias with effect size and p-value. We can compute the effect size of the two target words set against two attribute words set. To quantify the gender bias, we use (Chaloner and Maldonado, 2019b) subset of masculine (A 1 ) and feminine words(A 2 ) as an attribute words, and use career (T 1 ) and family (T 2 ) related words for target words set. We compare the effect size and p-value for different experiment environment by changing the attribute words, as shown in Table 2 in the paper. We can compute the association measure s, between target word t and the attribute word set as follows: We compute the effect size, the degree of bias, based on the difference between mean of association value as follows:

D Performance Test Result for Gender
Classifier C r To test gender indicating the ability of the gender classifier C r : z g → [0, 1], we tested indicating accuracy of the gender-definition words, i.e., he, she, etc.; and gender-stereotypical words, i.e., doctor, nurse, etc. We utilized 53 gender word pairs as test word pairs from entire gender word pairs, utilizing the remaining words for training. We selected well known gender-biased occupation words for examples of gender-stereotypical words, 10 for each gender case as follows: [doctor, programmer, boss, maestro, warrior , john, politician, statistician, athlete, nurse, homemaker, cook, cosmetics, dancer, mary, violinist, housekeeper, secretary].
The test accuracy for gender-definition words are 0.8490, 0.8867 for masculine and feminine words, respectively. For gender-stereotypical words, C r indicates correct gender biases for all male-biased words except the word athlete and all female-biased words. Figure 7 shows the visual separation of gender latent variables for masculine words and feminine words.

G Experimental Setup for Our Method
We implement the encoder E and the decoder D with one hidden layer and hyperbolic tangent function as an activation function. The generators C a and C g are implemented as feed-forward neural network with one hidden layer, followed by the hyperbolic tangent function as an activation function. The gender classifier C r is similarly implemented as the feed-forward neural network with one hidden layer, followed by sigmoid activation function for the output layer. The whole training was performed using the Adam optimizer with learning rate 10 −5 . We trained our model using a single Titan-RTX GPU. Each run takes approximately 2 hours including the time for saving the post-processed word embeddings.
As described in Appendix D, to test classification accuracy of the gender classifier C r : z g → [0, 1] for gender-definition words and gender stereotypical words, we only used 143 gender word pairs from entire gender word pairs on the training procedure. The remaining 53 gender word pairs were utilized for gender classification test in Appendix D.

I The Experimental Setting of Human Experiment
We conducted an human validation test on the linear analogies generated by the debiased embeddings to evaluate debiasing efficacy of each embedding. For the question "a is to b as c is to ?", words a, b were selected from gender word pairs of Sembias dataset and c was sampled from gender stereotypical words, i.e. homemaker, given by Bolukbasi et al. (2016). The question word is chosen from . In order to enable human subjects to efficiently compare generated words of each debiased word embedding, We compared only 5 baseline methods; Original GloVe embedding, Hard-Debias, ATT-Debias, CPT-Debias, GP-GN-Debias with our methods; CF-Debias-LA and CF-Debias-KA. As stated in section 4.4 of main paper, 18 Human subjects were asked to evaluate the 84 generated analogies from two perspectives; 1) the existence of gender bias on generated analogy, 2) the semantic validity of analogy. The semantic validity in our experiment equals to the question, "Is it possible to infer semantic relationship from generated analogy?".
The representative examples of the analogy questions are given as follows: "man is to woman as boss is to ?" , "f emale is to male as weak is to ?".

J The Experimental Settings for Other
Languages; Spanish and Korean For the gender word pairs required for gender debiasing, the query words used in the English version were translated into Spanish and Korean. In this procedure, some words, which are not present in the given corpus, were excluded.