Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models. We present Iterative Null-space Projection (INLP), a novel method for removing information from neural representations. Our method is based on repeated training of linear classifiers that predict a certain property we aim to remove, followed by projection of the representations on their null-space. By doing so, the classifiers become oblivious to that target property, making it hard to linearly separate the data according to it. While applicable for multiple uses, we evaluate our method on bias and fairness use-cases, and show that our method is able to mitigate bias in word embeddings, as well as to increase fairness in a setting of multi-class classification.


Introduction
What is encoded in vector representations of textual data, and can we control it? Word embeddings, pre-trained language models, and more generally deep learning methods emerge as very effective techniques for text classification. Accordingly, they are increasingly being used for predictions in real-world situations. A large part of the success is due to the models' ability to perform representation learning, coming up with effective feature representations for the prediction task at hand. However, these learned representations, while effective, are also notoriously opaque: we do not know what is encoded in them. Indeed, there is an emerging line of work on probing deep-learning derived representations for syntactic (Linzen et al., 2016;Hewitt and Manning, 2019;Goldberg, 2019), semantic (Tenney et al., 2019) and factual knowledge (Petroni et al., 2019). There is also evidence that they capture a lot of information regarding the demographics of the author of the text (Blodgett et al., 2016;Elazar and Goldberg, 2018).
What can we do in situations where we do not want our representations to encode certain kinds of information? For example, we may want a word representation that does not take tense into account, or that does not encode part-of-speech distinctions. We may want a classifier that judges the formality of the text, but which is also oblivious to the topic the text was taken from. Finally, and also our empirical focus in this work, this situation often arises when considering fairness and bias of languagebased classification. We may not want our wordembeddings to encode gender stereotypes, and we do not want sensitive decisions on hiring or loan approvals to condition on the race, gender or age of the applicant.
We present a novel method for selectively removing specific kinds of information from a representation. Previous methods are either based on projection on a pre-specified, user-provided direction (Bolukbasi et al., 2016), or on adding an adversarial objective to an end-to-end training process (Xie et al., 2017). Both of these have benefits and limitations, as we discuss in the related work section ( §2). Our proposed method, Iterative Nullspace Projection (INLP), presented in section 4, can be seen as a combination of these approaches, capitalizing on the benefits of both. Like the projection methods, it is also based on the mathematical notion of linear projection, a commonly used de-terministic operator. Like the adversarial methods, it is data-driven in the directions it removes: we do not presuppose specific directions in the latent space that correspond to the protected attribute, but rather learn those directions, and remove them. Empirically, we find it to work well. We evaluate the method on the challenging task of removing gender signals from word embeddings (Bolukbasi et al., 2016;Zhao et al., 2018). Recently, Gonen and Goldberg (2019) showed several limitations of current methods for this task. We show that our method is effective in reducing many, but not all, of these ( §4).
We also consider the context of fair classification, where we want to ensure that a classifier's decision is oblivious to a protected attribute such as race, gender or age. There, we need to integrate the projection-based method within a pre-trained classifier. We propose a method to do so in section §5, and demonstrate its effectiveness in a controlled setup ( §6.2) as well as in a real-world one ( §6.3).
Finally, while we propose a general purpose information-removal method, our main evaluation is in the realm of bias and fairness applications. We stress that this calls for some stricter scrutiny, as the effects of blindly trusting strong claims can have severe real-world consequences on individuals. We discuss the limitations of our model in the context of such applications in section §7.

Related Work
The objective of controlled removal of specific types of information from neural representation is tightly related to the task of disentanglement of the representations (Bengio et al., 2013;Mathieu et al., 2016), that is, controlling and separating the different kinds of information encoded in them. In the context of transfer learning, previous methods have pursued representations which are invariant to some properties of the input, such as genre or topic, in order to ease domain transfer (Ganin and Lempitsky, 2015). Those methods mostly rely on adding an adversarial component (Goodfellow et al., 2014;Ganin and Lempitsky, 2015;Xie et al., 2017;Zhang et al., 2018) to the main task objective: the representation is regularized by an adversary network, that competes against the encoder, trying to extract the protected information from its representation.
While adverserial methods showed impressive performance in various machine learning tasks, and were applied for the goal of removal of sensitive information (Elazar and Goldberg, 2018;Coavoux et al., 2018;Resheff et al., 2019;Barrett et al., 2019), they are notoriously hard to train. Elazar and Goldberg (2018) have evaluated adverserial methods for the removal of demographic information from representations. They showed that the complete removal of the protected information is nontrivial: even when the attribute seems protected, different classifiers of the same architecture can often still succeed in extracting it. Another drawback of these methods is their reliance on a main-task loss in addition to the adverserial loss, making them less suitable for tasks such as debiasing pretrained word embeddings. Xu et al. (2017) utilized a "nullspace cleaning" operator for increasing privacy in classifiers. They remove from the input a subspace that contains (but is not limited to) the nullspace of a pre-trained classifier, in order to clean information that is not used for the main task (and might be protected), while minimally impairing classification accuracy. While similar in spirit to our method, several key differences exist. As the complementary settingremoving the nullsapce of the main-task classifier vs. projection onto the nullspace of protected attribute classifiers -aims to achieve a distinct goal (privacy preserving), there is no notion of exhaustive cleaning. Furthermore, they do not remove protected attributes that are used by the classifier (e.g. when it conditions on gender).
A recent line of work focused on projecting the representation to a subspace which does not encode the protected attributes. Under this method, one identifies directions in the latent space that correspond to the protected attribute, and removes them. In a seminal work, Bolukbasi et al. (2016) aimed to identify a "gender subspace" in word-embedding space by calculating the main directions in a subspace spanned by the differences between gendered word pairs, such as # » he − # » she. They suggested to zero out the components of neutral words in the direction of the "gender subspace" first principle components, and actively pushed neutral words to be equally distant from male and female-gendered words. However, Gonen and Goldberg (2019) have shown that these methods only cover up the bias and that in fact, the information is deeply ingrained in the representations. A key drawback of this approach is that it relies on an intuitive selection of a few (or a single) gender directions, while, as we reveal in our experiments, the gender subspace is actually spanned by dozens to hundreds of orthogonal directions in the latent space, which are not necessarily as interpretable as the # » he − # » she direction. This observation aligns with the analysis of Ethayarajh et al. (2019) who demonstrated that debiasing by projection is theoretically effective provided that one removes all relevant directions in the latent space.

Objective and Definitions
Our main goal is to "guard" sensitive information, so that it will not be encoded in a representation. Given a set of vectors x i ∈ R d , and corresponding discrete attributes Z, z i ∈ {1, ..., k} (e.g. race or gender), we aim to learn a transformation g : R d → R d , such that z i cannot be predicted from g(x i ). In this work we are concerned with "linear guarding": we seek a guard g such that no linear classifier w(·) can predict z i from g(x i ) with an accuracy greater than that of a decision rule that considers only the proportion of labels in Z. We also wish for g(x i ) to stay informative: when the vectors x are used for some end task, we want g(x) to have as minimal influence as possible on the end task performance, provided that z remains guarded. We use the following definitions: Guarded w.r.t. a hypothesis class Let X = x 1 , ..., x m ∈ X ⊆ R d be a set of vectors, with corresponding discrete attributes Z, z i ∈ {1, ..., k}. We say the set X is guarded for Z with respect to hypothesis class H (conversely Z is guarded in X) if there is no classifier W ∈ H that can predict z i from x i at better than guessing the majority class.
Guarding function A function g : R n → R n is said to be guarding X for Z (w.r.t. to class H) if the set {g(x)|x ∈ X} is guarded for Z w.r.t. to H.
We use the term linearly guarded to indicate guarding w.r.t. to the class of all linear classifiers.

Iterative Nullspace Projection
Given a set of vectors x i ∈ R d and a set of corresponding discrete 1 protected attributes z i ∈ Z, we seek a linear guarding function g that remove the linear dependence between Z and X .
We begin with a high-level description of our approach. Let c be a trained linear classifier, parameterized by a matrix W ∈ R k×d , that predicts a property z with some accuracy. We can construct a projection matrix P such that W (P x) = 0 for all x, rendering W useless on dataset X . We then iteratively train additional classifiers W and perform the same procedure, until no more linear information regarding Z remains in X. Constructing P is achieved via nullspace projection, as described below. This method is the core of the INLP algorithm (Algorithm 1).

Nullspace Projection
The linear interaction between W and a new test point x has a simple geometric interpretation: x is projected on the subspace spanned by W 's rows, and is classified according to the dot product between x and W 's rows, which is proportional to the components of x in the direction of W 's rowpsace. Therefore, if we zeroed all components of x in the direction of W 's row-space, we removed all information used by W for prediction: the decision boundary found by the classifier is no longer useful. As the orthogonal component of the rowspace is the nullspace, zeroing those components of x is equivalent to projecting x on W 's nullspace. Figure 2 illustrates the idea for the 2 dimensional binary-classification setting, in which W is just a 2-dimensional vector.
For an algebraic interpretation, recall that the null-space of a matrix W is defined as the space This suggests a simple method for rendering z linearly guarded for a set of vectors X: training a linear classifier that is parameterized by W 0 to predict Z from X, calculating its nullspace, finding the orthogonal projection matrix P N (W 0 ) onto the nullspace, and using it to remove from X those components that were used by the classifier for predicting Z.
Note that the orthogonal projection P N (w 0 ) is the least harming linear operation to remove the linear information captured by W 0 from X, in the sense that among all maximum rank (which is not full, as such transformations are invertible-hence not linearly guarding) projections onto the nullspace of W 0 , it carries the least impact on distances. This is so since the image under an orthogonal projection into a subspace is by definition the closest vector in that subspace.
Iterative Projection Projecting the inputs X on the nullspace of a single linear classifier does not suffice for making Z linearly guarded: classifiers can often still be trained to recover z from the projected x with above chance accuracy, as there are often multiple linear directions (hyperplanes) that can partially capture a relation in multidimensional space. This can be remedied with an iterative process: After obtaining P N (W 0 ) , we train classifier W 1 on P N (W 0 ) X, obtain a projection matrix P N (W 1 ) , train a classifier W 2 on P N (W 1 ) P N (W 0 ) X and so on, until no classifier W m+1 can be trained. We return the guarding projection matrix P = P N (Wm) P N (W m−1 ) ...P N (W 0 ) , with the guarding function g(x) = P x. Crucially, the ith classifier W i is trained on the data X after the projection on the nullspaces of classifiers W 0 , ..., W i−1 and is therefore trained to find separating planes that are independent of the separating planes found by previous classifiers.
In Appendix §A.1 we prove three desired proprieties of INLP: (1) any two protected-attribute classifiers found in INLP are orthogonal (Lemma A.1); (2) while in general the product of projection matrices is not a projection, the product P calculated in INLP is a valid projection (Corollary A.1.2); and (3) it projects any vector to the intersection of the nullspaces of each of the classifiers found in INLP, that is, after n INLP iterations, P is a projection to N (W 0 ) ∩ N (W 1 ) · · · ∩ N (W n ) (Corollary A.1.3). We further bound the damage P causes to the structure of the space (Lemma A.2). INLP can thus be seen as a linear dimensionalityreduction method, which keeps only those directions in the latent space which are not indicative of the protected attribute.
During iterative nullspace projection, the property z becomes increasingly linearly-guarded in P x. For binary protected attributes, each intermediate W j is a vector, and the nullspace rank is d−1. Therefore, after n iterations, if the original rank of X was r, the rank of the projected input g(X) is at least r − n. The entire process is formalized in Algorithm 1.

Algorithm 1 Iterative Nullspace Projection (INLP)
Input : (X, Z): a training set of vectors and protected attributes n: Number of rounds Result: A projection matrix P Function GetProjectionMatrix(X, Z): INLP bears similarities to Partial Least Squares (PLS; Geladi and Kowalski (1986); Barker and Rayens (2003)), an established regression method. Both iteratively find directions that correspond to Z: while PLS does so by maximizing covariance with Z (and is thus less suited to classification), INLP learns the directions by training classifiers with an arbitrary loss function. Another difference is that INLP focuses on learning a projection that neutralizes Z, while PLS aims to learn low-dimensional representation of X that keeps information on Z.
Implementation Details A naive implementation of Algorithm 1 is prone to numerical errors, due to the accumulative projection-matrices multiplication P ← P N (W i ) P . To mitigate that, we use the formula of Ben-Israel (2015), which connects the intersection of nullspaces with the projection matrices to the corresponding rowspaces: (1) Where P R (W i ) is the orthogonal projection matrix to the row-space of a classifier W i . Accordingly, in practice, we do not multiply P ← P N (W i ) P but rather collect rowspace projection matrices P R(W i ) for each classifier W i . In place of each input projection X projected ← P N (W i ) X projected , we recalculate P := P N (w 1 )∩...∩N (w i ) according to 1, and perform a projection X projected ← P X. Upon termination, we once again apply 1 to return the final nullspace projection matrix P N (W 1 )∩... ∩N (Wn) . The code is publicly available. 2

Application to Fair Classification
The previous section described the INLP method for producing a linearly guarding function g for a set of vectors. We now turn to describe its usage in the context of providing fair classification by a (possibly deep) neural network classifier.
In this setup, we are given, in addition to X and Z also labels Y , and wish to construct a classifier f : X → Y , while being fair with respect to Z. Fairness in classification can be defined in many ways (Hardt et al., 2016;Madras et al., 2019;Zhang et al., 2018). We focus on a notion of fairness by which the predictor f is oblivious to Z when making predictions about Y .
To use linear guardedness in the context of a deep network, recall that a classification network f (x) can be decomposed into an encoder enc followed by a linear layer W : f (x) = W · enc(x), where W is the last layer of the network and enc is the rest of the network. If we can make sure that Z is linearly guarded in the inputs to W , then W will have no knowledge of Z when making its prediction about Y , making the decision process oblivious to Z. Adversarial training methods attempt to achieve such obliviousness by adding an adversarial objective to make enc(x) itself guarding. We take a different approach and add a guarding function on top of an already trained enc. We propose the following procedure. Given a training set X,Y and protected attribute Z, we first train a neural network f = W · enc(X) to best predict Y . This results in an encoder that extracts effective features from X for predicting Y . We then consider the vectors enc(X), and use the INLP method to produce a linear guarding function g that guards Z in enc(X). At this point, we can use the classifier W · g(enc(x)) to produce oblivious decisions, however by introducing g (which is lower rank than enc(x)) we may have harmed W s performance. We therefore freeze the network and fine-tune only W to predict Y from g(enc(x)), producing the final fair classifier f (x) = W · g(enc(x)). Notice that W only sees vectors which are linearly guarded for Z during its training, and therefore cannot take Z into ac-count when making its predictions, ensuring fair classification.
We note that our notion of fairness by obliviousness does not, in the general case, correspond to other fairness metrics, such as equality of odds or of opportunity. It does, however, correlate with fairness metrics, as we demonstrate empirically. Further refinement. Guardedness is a property that holds in expectation over an entire dataset. For example, when considering a dataset of individuals from certain professions (as we do in §6.3), it is possible that the entire dataset is guarded for gender, yet if we consider only a subset of individuals (say, only nurses), we may still be able to recover gender with above majority accuracy, in that sub-population. As fairness metrics are often concerned with classification behavior also within groups, we propose the following refinement to the algorithm, which we use in the experiments in §6.2 and §6.3: in each iteration, we train a classifier to predict the protected attribute not on the entire training set, but only on the training examples belonging to a single (randomly chosen) main-task class (e.g. profession). By doing so, we push the protected attribute to be linearly guarded in the examples belonging to each of the main-task labels.
6 Experiments and Analysis 6.1 "Debiasing" Word Embeddings In the first set of experiments, we evaluate the INLP method in its ability to debias word embeddings (Bolukbasi et al., 2016). After "debiasing" the embeddings, we repeat the set of diagnostic experiments of Gonen and Goldberg (2019).
Data. Our debiasing targets are the uncased version of GloVe word embeddings (Zhao et al., 2018), after limiting the vocabulary to the 150,000 most common words. To obtain labeled data X,Z for this classifier, we use the 7,500 most male-biased and 7,500 most female-biased words (as measured by the projection on the # » he − # » she direction), as well as 7,500 neutral vectors, with a small component (smaller than 0.03) in the gender direction. The data is randomly divided into a test set (30%), and training and development sets (70%, further divided into 70% training and 30% development examples).
Procedure We use a L 2 -regularized SVM classifier (Hearst et al., 1998) trained to discriminate between the 3 classes: male-biased, female-biased and neutral. We run Algorithm 1 for 35 iterations.

Results
Classification. Initially, a linear SVM classifier perfectly discriminates between the two genders (100% accuracy). The accuracy drops to 49.3% following INLP. To measure to what extent gender is still encoded in a nonlinear way, we train a 1layer ReLU-activation MLP. The MLP recovers gender with accuracy of 85.0%. This is expected, as the INLP method is only meant to achieve linear guarding 3 . Human-selected vs. Learned Directions. Our method differs from the common projection-based approach by two main factors: the numbers of directions we remove, and the fact that those directions are learned iteratively from data. Perhaps the benefit is purely due to removing more directions? We compare the ability to linearly classify words by gender bias after removing 10 directions by our method (running Algorithm 1 for 10 iterations) with the ability to do so after removing 10 manually-chosen directions defined by the difference vectors between gendered pairs 4 . INLPbased debiasing results in a very substantial drop in classification accuracy (54.4%), while the removal of the predefined directions only moderately decreases accuracy (80.7%). This shows that datadriven identification of gender-directions outperforms manually selected directions: there are many subtle ways in which gender is encoded, which are hard for people to imagine.
Discussion. Both the previous method and our method start with the main gender-direction of # » he − # » she. However, while previous attempts take this direction as the information that needs to be neutralized, our method instead considers the labeling induced by this gender direction, and then iteratively finds and neutralizes directions that correlate with this labeling. It is likely that the # » he − # » she direction is one of the first to be removed, but we then go on and learn a set of other directions that correlate with the same labeling and which are predictive of it to some degree, neutralizing each of them in turn. Compared to the 10 manually identified gender-directions from Bolukbasi et al. (2016), it is likely that our learned directions capture a much more diverse and subtle set of gender clues in the embedding space. Effect of debiasing on the embedding space. In appendix §A.2 we provide a list of 40 random words and their closest neighbors, before and after INLP, showing that INLP doesn't significantly damage the representation space that encodes lexical semantics. We also include a short analysis of the influence on a specific subset of inherently gendered words: gendered surnames (Appendix §A.4).
Additionally, we perform a semantic evaluation of the debiased embeddings using multiple word similarity datasets (e.g. SimLex-999 (Hill et al., 2015)). We find large improvements in the quality of the embeddings after the projection (e.g. on SimLex-999 the correlation with human judgements improves from 0.373 to 0.489) and we elaborate more on these findings in Appendix §A.3. Clustering. Figure 1 shows t-SNE (Maaten and Hinton, 2008) projections of the 2,000 most female-biased and 2,000 most male-biased words, originally and after t = 3, t = 18 and t = 35 projection steps. The results clearly demonstrate that the classes are no longer linearly separable: this behavior is qualitatively different from previous word vector debiasing methods, which were shown to maintain much of the proximity between female and male-biased vectors (Gonen and Goldberg, 2019). To quantify the difference, we perform K-means clustering to K = 2 clusters on the vectors, and calculate the V-measure (Rosenberg and Hirschberg, 2007) which assesses the degree of overlap between the two clusters and the gender groups. For the t-SNE projected vectors, the measure drops from 83.88% overlap originally, to 0.44% following the projection; and for the original space, the measure drops from 100% to 0.31%. WEAT. While our method does not guarantee attenuating the bias-by-neighbors phenomena that is discussed in Gonen and Goldberg (2019), it is still valuable to quantify to what extent it does mitigate this phenomenon. We repeat the Word Embedding Association Test (WEAT) from Caliskan et al. (2017) which aims to measure the association in vector space between male and female concepts and stereotypically male or female professions. Following Gonen and Goldberg (2019), we represent the male and female groups with common names of males and females, rather than with explicitly gendered words (e.g. pronouns). Three tests evaluate the association between a group of male names and a groups of female names to (1) career and family-related words; (2) art and mathematics words; and (3) artistic and scientific fields. In all three tests, we find that the strong association between the groups no longer exists after the projection (non-significant p-values of 0.855, 0.302 and 0.761, respectively).

Bias-by-Neighbors.
To measure bias-byneighbors as discussed in (Gonen and Goldberg, 2019), we consider the list of professions provided in (Bolukbasi et al., 2016) and measure the correlation between bias-by projection and bias by neighbors, quantified as the percentage of the top 100 neighbors of each profession which were originally biased-by-projection towards either of the genders. We find strong correlation of 0.734 (compared with 0.852 before), indicating that much of the bias-by-neighbors remains. 5

Fair Classification: Controlled Setup
We now evaluate using INLP with a deeper classifier, with the goal of achieving fair classification. Classifier bias measure: TPR-GAP. To measure the bias in a classifier, we follow De-Arteaga et al.
(2019) and use the TPR-GAP measure. This measure quantifies the bias in a classifier by considering the difference (GAP) in the True Positive Rate (TPR) between individuals with different protected attributes (e.g. gender/race). The TPR-GAP is tightly related to the notion of fairness by equal opportunity (Hardt et al., 2016): a fair classifier is expected to show similar success in predicting the task label Y for the two populations, when conditioned on the true class. Formally, for a binary protected attribute z and a true class y, define: where Z is a random variable denoting binary protected attribute, z and z denote its two values, and 5 Note that if, for example, STEM-related words are originally biased towards men, the word "chemist" after the projection may still be regarded as male-biased by neighbors, not because an inherent bias but due to its proximity to other originally biased words (e.g. other STEM professions). Y ,Ŷ are random variables denoting the correct class and the predicted class, respectively. Experiment setup. We begin by experimenting with a controlled setup, where we control for the proportion of the protected attributes within each main-task class. We follow the setup of Elazar and Goldberg (2018) which used a twitter dataset, collected by Blodgett et al. (2016), where each tweet is associated with "race" information and a sentiment which was determined by their belonging to some emoji group. Naturally, the correlation between the protected class labels and the main-class labels may influence the fairness of the model, as high correlation can encourage the model to condition on the protected attributes. We measure the TPR-GAP on predicting sentiment for the different race groups (African American English (AAE) speakers and Standard American English (SAE) speakers), with different imbalanced conditions, with and without application of our "classifier debiasing" procedure.
In all experiments, the dataset is overly balanced with respect to both sentiment and race (50k instances for each). We change only the proportion of each race within each sentiment class (e.g., in the 0.7 condition, the "happy" sentiment class is composed of 70% AAE / 30% SAE, while the "sad" class is composed of 30% AAE / 70% SAE).
Our classifier is based on the DeepMoji encoder (Felbo et al., 2017), followed by a 1-hideen-layer MLP. The DeepMoji model was trained on millions of tweets in order to predict their emojis; a model which was proven to perform well on different classification tasks (Felbo et al., 2017), but also encodes demographic information (Elazar and Goldberg, 2018). We train this classifier to predict sentiment. We then follow the procedure in §5: training a guarding function on the hidden layer of the MLP, and re-training the final linear layer on the guarded vectors.    As expected the TPR-GAP grows as we increase the correlation between class labels and protected attributes. The accuracy grows as well. Applying our debiasing technique significantly reduced the TPR gap in all settings, although hurting more the main task accuracy in the highly-imbalanced setting. In Appendix A.5, we give some more analysis on the balance between performance and TPR-Gap and show that one can control for this ratio, by using more iterations of INLP.

Fair Classification: In the Wild
We now evaluate the fair classification approach in a less artificial setting, measuring gender bias in biography classification, following the setup of De-Arteaga et al. (2019).
They scraped the web and collected a dataset of short biographies, annotated by gender and profession. They trained logistic regression classifiers to predict the profession of the biography's subject based on three different input representation: bag-of-words (BOW), bag of word-vectors (BWV), and RNN based representation. We repeat their experiments, using INLP for rendering the classifier oblivious of gender. Setup. Our data contains 393,423 biographies. 6 We follow the train:dev:test split of De-Arteaga et al. (2019) Table 2. INLP moderately changes main-task accuracy, with a 1.9% increase in BOW, a 5.1% decrease in performance in BWV and a 5.51% decrease in BERT. GAP T P R,RM S g is significantly decreased, indicating that on average, the true positive rate of the classifiers for male and female become closer: in BOW representation, from 0.203 to 0.124 (a 38.91% decrease); in BWV, from 0.184 to 0.089 (a 51.6% decrease); and in BERT, from 0.184 to 0.095 (a 48.36% decrease). We measure the correlation between GAP T P R y,f emale for each profession y, and the percentage of biographies of women in that profession. In BOW representation, the correlation decreases from 0.894 prior to INLP to 0.670 after it (a 33.4% decrease). In BWV representation, the correlation decreases from 0.896 prior to INLP to 0.425 after it (a 52.5% decrease). In BERT representation, the correlation decreases from 0.883 prior to INLP to 0.470 following it (a 46.7% decreases; Figure 4b). De-Arteaga et al.
(2019) report a correlation of 0.71 for BWV representations when using a "scrubbed" version of the biographies, with all pronouns and names removed. INLP significantly outperforms this baseline, while maintaining all explicit gender markers in the input. Analysis. How does imposing fairness influence the importance the logistic classifier attribute to different words in the biography? We take advantage of the BOW representation and visualize which features (words) influence each prediction (profession), before and after the projection. According to Algorithm 1, to debias an input x, we multiply W (P x). Equivalently, we can first multiply W by P to get a "debiased" weight matrix W . We begin by testing how much the debiased weights of words that are considered to be biased were changed during the debiasing, compared to random vocabulary words. We compare the relative change before and after the projection of these words, for every occupation. Biased words undergo an average relative change of x1.23 compared to the average change of the entire vocabulary, demonstrating that biased words indeed change more. The per-profession breakout is available in Figure 2 in Appendix §A.6.1.
Next, we test the words that were changed the most during the INLP process. We compare the weight difference before and after the projection. We sort each profession words by weight, and average their location index for each professions. Many words indeed seem gender specific (e.g. ms., mr., his, her, which appears in locations 1, 2, 3 and 4 respectively), but some seem unrelated, perhaps due to spurious correlations in the data. The complete list is available in Table 4 in the Appendix §A.6.1; an analogous analysis for the FastText representation is available at Appendix §A.6.2.

Limitations
A limitation of our method when used in the context of fairness is that, like other learning approaches, it depends on the data X,Z that is fed to it, and works under the assumption that the training data is sufficiently large and is sampled i.i.d from the same distribution as the test data. This condition is hard to achieve in practice, and failure to provide sufficiently representative training data may lead to biased classifications even after its application. Like other methods, there are no magic guarantees, and the burden of verification remains on the user. It is also important to remember that the method is designed to achieve a very specific sense of protection: removal of linear information regarding a protected attribute. While it may correlate with fairness measures such as demographic parity, it is not designed to ensure them. Finally, it is designed to be fed to a linear decoder, and the attributes are not protected under non-linear classifiers.

Conclusion
We present a novel method for removing linearlyrepresented information from neural representations. We focus on bias and fairness as case studies, and demonstrate that across increasingly complex settings, our method is capable of attenuating societal biases that are expressed in representations learned from data.
While we focused on bias, Iterative Nullspace Projection has broader possible use-cases, and can be utilized to remove specific components from a representation, in a controlled manner. This method can be applicable for other end goals, such as style-transfer, disentanglement of neural representations and increasing their interpretability. We aim to explore those directions in a future work.

A.1 INLP Guarantees
In this section, we prove, for the binary case, an orthogonality property for INLP classifiers: each two classifiers w i and w j from two iterations steps i and j are orthogonal (Lemma A.1). Several useful properties of the matrix P that is returned from INLP emerge as a direct result of orthogonality: the product of the projection matrices calculated in the different INLP steps is commutative (Corollary A.1.1); P is a valid projection (Corollary A.1.2); and P projects to a subspace which is the intersection of the nullspaces of all INLP classifiers N (w 1 ) ∩ N (w 2 ) ∩ · · · ∩ N (w n ) (Corollary A.1.3). Furthermore, we bound the influence of P on the structure of the representation space, demonstrating that its impact is limited only to those parts of the vectors that encode the protected attribute (Lemma A.2). We prove those properties for two consecutive projection matrices P 1 and P 2 from two consecutive iterations of Algorithm 1, presented below in 5. The general property follows by induction.
Lemma A.1. if w 2 is initialized as the zero vector and trained with SGD, and the loss L is convex, then w 2 is orthogonal to w 1 , that is, w 1 · w 2 = 0.
Proof. In line 4 of the algorithm, we calculate w 2 = argmin w L(w; X = P 1 X; Z). For a convex L and a linear model w, it holds that the gradient with respect to w is a linear function of x: ∇ w L(x i ) = α i x i for some scalar α i . It follows that after t stochastic SGD steps, w t 2 is a linear combination of input vectors x 1 , . . . , x i , . . . , x t . Since we constrain the optimization to x i ∈ N (w 1 ), and considering that fact the nullspace is closed under addition, at each step t in the optimization it holds that w t 2 ∈ N (w 1 ). In particular, this also holds for the optimal w * 2 7 .
We proceed to prove commutativity based on this property.
Note that in practice, we enforce Corollary A.1.3 by using the projection Equation 1 (section 4). As such, the matrix P that is returned from Algorithm 1 is a valid projection matrix to the intersection of the nullspaces even if the the conditions in Lemma A.1 do not hold, e.g. when L is nonconvex or w 2 is not initialized as the zero vector.

INLP Approximately Preserves Distances.
While the projection operations removes the protected information from the representations, ostensibly it could have had a detrimental impact on the structure of the representations space: as a trivial example, the zero matrix O is another operator that removes the protected information, but at a price of collapsing the entire space into the zero vector. The following lemma demonstrate this is not the case. The projection minimally damages the structure of the representation space, as measured by distances between arbitrary vectors: the change in (squared) distance between x, x ∈ R n is bounded by the difference between the "gender components" of x and x .
Lemma A.2. Let #» w ∈ R n be a unit gender direction found in one INLP iteration, and let x, x ∈ R n be arbitrary input vectors. Let P := P N ( #» w) be the nullspace projection matrix corresponding to #» w. Let d(x, x ) := ||x − x || and d P (x, x ) := ||P x − P x || be the distances between x, x before and after the projection, respectively. Then the following holds: Proof. notation: we denote the ith entry of a vector x by x i . Since #» w is the parameter vector of a gender classifier, a point x ∈ R n can be classified to a gender g ∈ {0, 1} according to the sign of the dot product x #» w. Note that in the binary case, the nullspace projection matrix P is given by Where #» w #» w T is the outer product. By definition, if #» w is in the direction of one of the axes, say without loss of generality the first axis, such that #» w = [1, 0, ..., 0], then the following holds: Such that #» w #» w T is the zero matrix except its (1, 1) entry, and then P is simplified to I.e, the unit matrix, except of a zero in the (1, 1) position. Hence, the projection operator P x keeps x intact, apart from zeroing the first coordinate x 1 . We will take advantage of this property, and rotate the axes such that #» w is the direction of the first axis. We will show that the results we derive this way still apply to the original axes system.
Let R be a rotation matrix, such that after the rotation, the first coordinate of Rx is aligned with #» w: One can always find such rotation of the axes. Let x ∈ R n be another point in the same space. Given the original squared distance between x and x : Our goal is to bound the squared distance between the projected points in the new coordinate system: Where [P ] R denotes the projection matrix P in the rotated coordinate system, which takes the form 7.
Note that R, being a rotation matrix, is orthogonal. By a known result in linear algebra, multiplication by orthogonal matrices preserves dot product and distances. That means that the distance is the same before and after the rotation: d P,R (x, x ) = d P (x, x ), so we can safely bound d P,R (x, x ) and the same bound would hold in the original coordinate system. By 7, Note that in general it holds that for any a ≥ b ≥ 0 Combining 12 with 11 when taking a = d( From 11 one can also trivially get Combining 14 and 13 we finally get: Or, equivalently, after subtracting d(x, x ) from all elements and multiplying by -1: Note that this result has a clear interpretation: the difference between the distance of the projected x, x and the distance of the original x, x is bounded by the difference of x and x in the gender direction #» w. In particular, if x and x are equally male-biased, their distance would not change at all; if x is very male-biased and x is very femalebiased, the projection would significantly alter the distance between them.  The test sets are composed of word pairs, where each pair was annotated by humans to give a similarity or relatedness score. To evaluate a model against such data, each pair is given a score (in the case of word embedding, cosine similarity) and then we calculate Spearman correlation between all the score pairs. The results on the regular Glove embeddings before and after the gender debiasing are presented in Table 3. We observe a major improvements across all evaluation sets after the projection: between 0.044 to 0.116 points.

A.2 Influence on Local Neighbors in Glove Space
This major difference in performance is rather surprising. It is not clear how to interpret the positive influence on correlation with human judgements. This puzzle is further compounded by the fact the projection reduces the rank of the embedding spaces, and by definition induces loss of information. We hypothesize that many of the words in the embedding space contain a significant gender component, which is not correlated with humans judgements of similarity. While intriguing, testing this hypothesis is beyond the scope of this work, and we leave the more rigorous answer to a future work.  The results in Table 1 suggest that, as expected, the projection has little influence on the lexical semantics of unbiased words, as measured by their closest neighbors in embedding space. But how does the projection influence inherently gendered words? Table 2 contains the closest-neighbors to the Glove representations of gendered surnames, before and after the projection. We observe an interesting tendency to move from neighbors which are other gendered surnames, towards family names, which are by definition gender-neutral (for instance, the closest neighbor of "Robert" changes from "Richard" to "Pattinson"). Another interesting tendency is to move towards place names bearing a connection to that surnames (For instance, the closest neighbor of "Sophia" changs to "Hagia"). At the same time, some gendered surnames remain close neighbors even after the projection.

A.5 Performance and "Fair Classification" as a Function of INLP Iterations
In Section 6.2 where we compare the accuracy and TPR-Gap before and after using INLP for a certain amount of iterations.  of their model. In Figure 1 we detail these tradeoffs for the 0.8 ratio, where the original TPR-Gap originally is the highest.
We note that the performance is minimally damaged for the first 180 iterations, while the TPR-Gap improves greatly, after-which, both metric account for larger drops. Using this trade-off, one can decide how much performance they are willing to sacrifice in order to get a less biased model.  In this section, we present the raw results of the experiment aimed to assess the influence of INLP on specific words, under the bag-of-words model, for the biographies experiments (Section 6.3.1). Table 4 lists the words most influenced by INLP projection (on average over all professions) after the debiasing procedure explained in Section 6.3. Figure 2 presents the relative change of biased word for each profession, compared to a random sample.

A.6.2 Bag-of-Word-Vectors Model
In this section, we present an analysis for the influence of INLP projection on the FastText representation of individual words, under the bag-of-wordvectors model, for the biographies experiments (Section 6.3.1). We begin by ordering the vocabulary items by their cosine similarity to each of the top 15 gender directions found in INLP (i.e., their similarity to the weight vector of each classifier). For each gender direction w i , we focus on the 20,000 most common vocabulary items, and calculate the closest words to w i (to get male-biased words) as well as the closest words to −w i (to get female-biased words). The result are presented in Table 5. The first gender direction seems to capture pronouns. Other gender directions capture socially biased terms, such as "preschool" (direction 10), "cookbooks" (direction 3) or other gender-related terms, such as "LGBTQ" (direction 15) or "femininity" (direction 6). Interestingly, those are mostly female-biased terms. As for the male-biased words, some directions capture surnames, such as "Gordon" and "Aviv". Other words which were found to be male-biased are less interpretable, such as words specifying years (direction 4), organizational terms such as "Organizers", "specifications" (direction 14), or the words "Papers", "Categories" (direction 7). It is not clear if those are the result of spurious correlations/noise, or whether they reflect actual subtle differences in the way the biographies of men and women are written.
Gender rowspace The above analysis focuses on what information do individual gender directions convey. Next, we aim to demonstrate the influence of the final INLP projection on the representation of words. To this end, we rely on the rowspace of the INLP matrix P . Recall that the rowspace is the orthogonal complement of the nullspace. As the INLP matrix P projects to the intersection of nullspaces of the gender directions, the complement P R := I − P projects to the union of rowspaces of individual gender directions. This is a subspace which is spanned by all gender direc-tions, and thus can be thought of as an empirical gender subspace within the representation space.
For a given word vector w, the "gender norm"the norm of its projection on the rowspace, ||P R w|| -is a scalar quantity which can serve as a measure for the gender-bias of the word. We sort the vocabulary by the ratio between the gender norm and the original norm, ||P R w|| ||w|| and present the 200 most gendered words (Table 6).