Dynamically Disentangling Social Bias from Task-Oriented Representations with Adversarial Attack

Representation learning is widely used in NLP for a vast range of tasks. However, representations derived from text corpora often reflect social biases. This phenomenon is pervasive and consistent across different neural models, causing serious concern. Previous methods mostly rely on a pre-specified, user-provided direction or suffer from unstable training. In this paper, we propose an adversarial disentangled debiasing model to dynamically decouple social bias attributes from the intermediate representations trained on the main task. We aim to denoise bias information while training on the downstream task, rather than completely remove social bias and pursue static unbiased representations. Experiments show the effectiveness of our method, both on the effect of debiasing and the main task performance.


Introduction
Supervised neural networks have achieved remarkable success in a wide range of natural language processing (NLP) tasks. The fundamental capability of these neural models is to learn effective feature representations (Bengio et al., 2013) for the downstream prediction task. Unfortunately, the learned representations frequently contain undesirable biases with respect to things that we would rather not use for decision making. We refer to such inappropriate factors as protected attributes (Elazar and Goldberg, 2018a). Biased information has serious real-world consequences. For example, concerns have been raised about automatic resume filtering systems giving preference to male applicants when the only distinguishing factor is the applicants' gender (Sun et al., 2019). In this paper, we focus on social bias, such as gender bias which is the preference or prejudice towards one gender over the other (Moss-Racusin et al., 2012), race bias and age bias. From the perspective of the debiasing target, previous debiasing works can be approximately classified into two types, word embedding (Bolukbasi et al., 2016;Caliskan et al., 2017;Zhao et al., 2018;Manzini et al., 2019;Wang et al., 2020;Kumar et al., 2020) and sentence embedding (Xu et al., 2017;Elazar and Goldberg, 2018a;Zhang et al., 2018;Ravfogel et al., 2020). The former aims to reduce the gender bias in word embedding, either as a post-processing step (Bolukbasi et al., 2016) or as part of the training procedure (Zhao et al., 2018). The latter focuses on removing these protected attributes from the downstream intermediate representations (Elazar and Goldberg, 2018a;Ravfogel et al., 2020). In this paper, we consider the latter setting and focus on how to mitigate undesirable social bias from the encoded representations without hurting the performance of the main task.
In terms of debiasing methods, previous models are either based on projection on a pre-specified, user-provided direction (Bolukbasi et al., 2016) or null-space (Xu et al., 2017;Ravfogel et al., 2020), or on adding an additional gender discriminator (Xie et al., 2017;Elazar and Goldberg, 2018a). The former first trains an intermediate feature extractor on the main task, then using a separate projection method to remove social bias from the representations, finally fine-tuning on the main task. The debiasing procedure can be regarded as static because of no direct interaction between the main task and the debiasing task. Therefore, these methods have no guarantee that the representations for predicting the main task do not contain any bias information. Existing work, (Gonen and Goldberg, 2019), has shown that these methods only cover up the bias and that in fact, the information is deeply ingrained in the representations. Compared to these static debiasing methods, gender discriminator based methods (Elazar and Goldberg, 2018a;Zhang et al., 2018) use the traditional generative adversarial network (GAN) (Goodfellow et al., 2014) to distinguish protected gender attributes from encoded representations. However, they are notoriously hard to train (Ganin and Lempitsky, 2015). Elazar and Goldberg (2018a) has shown that the complete removal of the protected information is nontrivial: even when the attribute seems protected, different classifiers of the same architecture can often still succeed in extracting it. Hence, we aim to dynamically disentangle the social bias from the encoded representations while jointly training on the main task in a more stable way, rather than directly remove protected attributes. In fact, we show that bias information always remains even after adversarial debiasing and can be reconstructed from the encoded representations. The main goal of debiasing is to prevent downstream models from utilizing these social bias in the representations, that is, dynamic disentanglement instead of complete removal, as Fig 1 displays. In this paper, we propose an adversarial disentangled debiasing model to dynamically decouple social bias attributes from the intermediate representations trained on the main task. Our motivation is to denoise bias information while training on the downstream task, rather than completely remove social bias and pursue static unbiased representations. Previous works (Elazar and Goldberg, 2018a;Gonen and Goldberg, 2019) show that even debiasing models achieve high fairness (Hardt et al., 2016), a fair amount of protected information still remains and can be extracted from the encoded representations. We argue that one can hardly re-  move all gender or race directions in the latent space but only preserve bias-free prediction on the downstream task. Specifically, we use a protected attribute classifier to generate model-agnostic adversarial worst-case perturbations to the representations in the direction that significantly increases the classifier's loss. Then we apply the perturbations to train the model of the downstream task end-toend. The main difference between our method and GAN-based counterparts is that GANs suffer from unstable training for the two-stage min-max procedure but our method directly computes gradientbased perturbations to disentangle bias information from the representations. We hope to provide new insights and directions towards solving social bias issues. 1 2 Approach

Problem Formulation
Our main goal is to disentangle protected attributes from the representations of downstream tasks so that biased information can not affect the decision of the model on the main task. In other words, we aim to achieve fairness by equalizing the opportunity (Hardt et al., 2016) between individuals with different protected attributes (e.g. gender/race). Given a set of input samples x i , and corresponding discrete attributes Z, z i ∈ {1, . . . , k} (e.g. gender or race) 2 , we aim to learn unbiased representations h i ∈ R d , so that z i can pose as minimal negative effect as possible on the main task performance.

Overall Architecture
Fig 2 shows the overall architecture of our proposed method, including four core steps: protected forward, debiasing backward, main task forward, and update parameters.
(1) protected forward: We first pre-train a protected attribute classifier then compute the classification cross-entropy loss L protected for each input sample x.
(2) debiasing backward: We maximize the loss L protected of the protected attribute classifier to obtain the adversarial decoupling perturbation δ. (3) main task forward: Then we sum the original input x and perturbation δ to get a new adversarial sample x adv . We forward the sample x adv to the main task classifier to compute the loss L main of the downstream task. (4) update parameters: Finally, the overall model is updated by the sum of two losses L protected , L main . We will dive into the details of each procedure in the following section.

Adversarial Semantic Disentanglement
Protected Forward In Fig 2, we adopt BiLSTM as the shared context encoder by the main task classifier and protected attribute classifier. We first feed each token to an embedding layer to get token embedding e, then a BiLSTM encoder is adopted to get the context-aware representation h i for each token x i . Then, we use an attentive pooling layer to calculate the sentence embedding h. After that, a fully-connected layer followed by a softmax output layer is used to predict the protected attributê y i . Finally, we can get the classification crossentropy loss L protected . 3 In the experiment, we observe that pre-training the protected attribute classifier can effectively accelerate the whole training progress of debiasing. We also demonstrate that jointly training the protected attribute classifier and the main task classifier achieves superior performance in Section 4.2.
Debiasing Backward This is the primary step of our adversarial semantic disentanglement. Our main idea is to perform adversarial attacks (Goodfellow et al., 2015;Kurakin et al., 2016;Miyato et al., 2016;Jia and Liang, 2017;Zhang et al., 2019;Ren et al., 2019) to dynamically decouple social bias attributes from the intermediate representations trained on the main task. Specifically, we need to compute a worst-case perturbation δ that maximizes the original classification cross-entropy loss L protected of the protected attribute classifier: where θ represents the parameters of the protected attribute classifier and x denotes a given sample. is the norm bound of the perturbation δ. However, due to model complexity, accurate computation for δ is costly and inefficient. Similar to Vedula et al. (2020) and Ru et al. (2020), we apply Fast Gradient Value (FGV) (Rozsa et al., 2016) to approximate a worst-case perturbation δ: where f represents the protected attribute classifier. We perform normalization to g and then use a small to ensure the approximate is reasonable. Section 4.3 validates a proper value of can balance the debiasing effect and the main task performance. Finally, we can obtain the pseudo adversarial sample x adv = x + δ. Intuitively we aim to obtain a debiased representation x adv by confusing the protected attribute classifier. Thus, the main task classifier can make a fair decision conditioned on the disentangled representation.
Main Task Forward After obtaining the pseudo adversarial sample x adv , we forward the sample x adv to the main task classifier to compute the loss L main of the downstream task, similar to protected forward. We find the location of adding adversarial perturbation plays a role in debiasing performance in Section 4.4. In a nutshell, adding noise to the word embedding layer achieves the best debiasing performance.
Update Parameters Finally, we apply the two classification objectives to update the parameters of the model as the dashed lines in Fig 2 show. Note that the loss L protected of the protected attribute classifier only updates the MLP and softmax layers while the loss L main of the main task classifier updates all the model parameters, including the lowlevel encoding layers. The setting aims to avoid the negative effect of the protected attribute classifier on main task performance.

Setup
Datasets Following the setup of (Ravfogel et al., 2020), we test the performance of our debiasing Sentiment (Main Task) TPR-GAP ( contains "race" information, and emojis correspond with specific emotion groups. According to the label of race and sentiment, we split the data into four classes: African American English (AAE) speaker with "happy" sentiment, Standard American English (SAE) speaker with "happy" sentiment, AAE speaker with "sad" sentiment and SAE speaker with "sad" sentiment. Following (Elazar and Goldberg, 2018b), we filter the corpus and 176K tweets left (44k for each class). Then we divide them into 40k samples for training, 2k for developing, and 2k for testing, following (Ravfogel et al., 2020). In the controlled setup, we introduce a bias ratio relevant to the sentiment and race to control the imbalance proportion of samples in four groups, following (Ravfogel et al., 2020). e.g., in the 0.8 condition, the AAE class contains 80% happy / 20% sad samples, while the SAE class contains 80% sad / 20% happy samples. And in the 0.5 conditions, all four categories contain the same number of samples. In all experiments, the unbalance factor of the development set and test set is set to 0.5. The biography corpus contains 393,423 biographies, the corresponding professions (28 classes) labels and gender (protected attributes) labels. We split the dataset into 255,710, 39,369, 98,344 samples for training, validation and testing, as consistent with (De-Arteaga et al., 2019;Ravfogel et al., 2020).
Baselines We compare our model with these baselines as follow: • Original is the main task classifier without any debiasing procedure as a baseline.
• INLP (Ravfogel et al., 2020) is a linear debiasing method, which removes the protected information from neural representations by iterative training the linear classifiers which predict the protected attributes. 4 • Random Noise replaces the debiasing perturbation generated by the protected classifier with random noise.

Implementation Details
To demonstrate the effectiveness of our method, we use the same model structure of the main task (sentiment classification) as (Ravfogel et al., 2020), where the DeepMoji encoder (Felbo et al., 2017) and an one-hiddenlayer MLP constitute the classifier. Besides, for simplicity, we use the same structure of classifier for predicting protected attributes. Both the unbalanced training data and the pre-trained DeepMoji model which has been proven that encodes demographic information would lead the downstream MLP classifier to make biased predictions. We then perform debiasing training for the main-task model following the process described in section 2.3 on the imbalanced training set with the imbalance factor and test the debiased model on the balanced test set.
Besides, we follow (Ravfogel et al., 2020) to evaluate our debiasing method on the biography corpus as a wild setup to verify the validity of our method in a less artificial setting. In this wild set up, we construct a similar model structure to the DeepMoji encoder, with a two-layer bidirectional RNN as the encoder, except for the attention operation. There are two input representation types of the encoder: FastText and BERT (Devlin et al., 2019). In the FastText experiments, we directly use the trained word embedding that provided by (Ravfogel et al., 2020), to represent each biography as a sequence of vectors. And in the BERT experiments, we use BERT as a sequence-to-sequence encoder to obtain the representation of each word in the sentence. Then we feed the sentence representations into the model and perform the debiasing training process.
For all the experiments, we train and test our model on single 2080Ti GPU, and we use Al-lenNLP framework (Gardner et al., 2017) to implement our model. The hidden size of the 1-hiddenlayer MLP classifier used in all of the above experiments is set to 300. In a controlled experiment, our debiasing method takes an average of ten minutes to run, and the total parameters of our models are 23M, including a DeepMoji encoder, a main task classifier, and a protected classifier. In the wild experiment, the model size of the FastText experiment is 127M, which takes an average of 15 minutes to run. While the model size of the BERT experiment is 114M, and it takes an average of 55 minutes to run, due to the use of BERT to encode the sentences. It's worth mentioning that our method converges with only one or two epochs, which is faster than other debiasing methods. In practice, we empirically find that the debiasing performance can reach the best when the L2-Norm of perturbation is between 1/3 and 2/3 of the corresponding disturbed vectors' L2-Norm. For example, in the first experiment, the L2-Norm size of the embedding vector is around 4, then we could set the normalized scale to (1.2, 1.8).
Metrics To evaluate the bias in the model, following (Ravfogel et al., 2020;De-Arteaga et al., 2019), we calculate TPR-GAP to measure the difference (GAP) in the True Positive Rate (TPR) between the groups with different protected attributes which can reflect the unfairness existing in NLP models: where y is the main task label of the input representation X, and p, p denote the protected attribute P 's two values. Then we use TPR-GAP to measure the degree of bias, which calculate the root-mean square of GAP T P R p,y over all main task label y: where N is the label set of all main task (sentiment or profession  have a strong correlation with the percentage of a certain gender group in different profession y, therefore GAP T P R,RM S can reflect an overview of bias across all different main attributes. We use GAP T P R,RM S to measure the bias existing in the models. Table 1 displays the experimental results on the DIAL dataset under different ratios of data imbalance proportion which can reflect the degree of dataset bias. We analyze the results from two perspectives, TPR-GAP (Debias) and Sentiment (Main Task). For TPR-GAP (Debias), our method consistently outperforms other baselines under all ratios, especially on the more biased dataset. It demonstrates the effectiveness of our proposed adversarial semantic disentanglement. We also observe Random Noise can hardly mitigate social bias which confirms the necessity of the protected attribute classifier. For the performance of the main sentiment classification task, our method reaches close to the original baseline while INLP suffers from a severe drop under a large ratio. The results prove that our method can better avoid the negative effect of the debiasing procedure on main task performance. To further evaluate the debiasing effect, we also show the results of the wild biography classification dataset in Table 2. Results show that our method both achieves superior performance than other baselines on Accuracy of the main task and TPR-GAP of debiasing. Compared to the significant improvements on the DIAL dataset, we hypothesize that the bias degree of the dataset makes a difference to the range of improvements.

Fixed Encoder vs. Non-fixed Encoder
In previous works, it is common to pre-train the sentence encoder in advance and keep the encoder fixed while applying the debias algorithm. However, it is unclear whether this conventional experiment setup is applicable to our approach. Since our approach dynamically generates perturbation to decouple social bias from context via adversarial attacks, we expect the non-fixed encoder to generate perturbation of higher quality. To check this, we conduct two groups of experiments in the DIAL dataset, where one group uses a fixed encoder while the other group keeps the contextual encoder trainable. Note that we set the bias ratio to 0.6 in both two groups of experiments.
Fig 3 shows the experimental results. In Fig  3 above, we observe that our approach with the non-fixed encoder consistently achieves better debias effectiveness compared to the fixed encoder counterpart with a large margin. When the perturbation intensity increases, both experimental settings achieve an increasingly better debias effect.
On the other hand, as shown in Fig 3 below, the fixed encoder approach suffers a severe performance drop in classification accuracy with increasing perturbation intensity. Meanwhile, the classification accuracy under the non-fixed encoder setting is still increasing, and even outperforms the fixed encoder one when a relatively large perturbation intensity is applied. We argue that, with a non-fixed encoder, our approach can learn a high-quality perturbation for representation debias, and meanwhile continuously optimize for the main task.

Protected Classifier: "Static" vs. Training on-the-fly
As discussed in the previous section, our proposed adversarial disentangled debiasing method requires the protected classifier to learn an accurate decision boundary of the protected attributes, such that the debiasing perturbation approximates the direction that mostly eliminates the model's discrimination of the protected attributes. Naturally, we have two options: either fix the parameters of protected classifier to generate the relatively static debiasing perturbation, or train the protected classifier on-the-fly during the main classifier training process to offer a relatively dynamic perturbation. To verify which one performs better, we adopt two groups of experiments. In the "static" setting, we keep the parameters of the protected classifier fixed. Whether the parameters of the encoder are fixed or not, the debiasing perturbation generated by the protected classifier would be relatively static. It's worth noting that if the parameters of the encoder are fixed, the debiasing perturbation would be totally static. While in the training on-the-fly setting, we reserve the gradient of the protected classifier and update its parameters together with the main task model (context encoder and main task classifier). According to the conclusions in

Bias Decrement
Figure 5: The debias effectiveness (above) as well as the classification accuracy on the main task (below) of our proposed approach in the DIAL dataset, with the perturbation intensity increases from 0.1 to 7.0. We set the bias ratio to 0.8 and all parameters trainable. section 4.1, we make the context encoder trainable in both settings and use the same objective to train the main classifier.
The results are displayed in Fig 4. We can find that both settings have the ability to debiasing in the DIAL dataset, showing the effectiveness of our approach in both settings. However, the training onthe-fly strategy consistently outperforms the "static" strategy under various perturbation intensities. We hypothesize that the difference is mainly because under the training on-the-fly strategy, the protected classifier will have a chance to adjust the decision boundary when the context encoder updates, and thus continuously generates better dynamic debiasing perturbation via adversarial attacks.

Influence of Perturbation Intensity
To explore how the perturbation intensity influences the debias effectiveness and the performance of main task, we run multiple experiments with only changing the perturbation intensity. We experiment with a wide range of perturbation intensity, from 0.1 to 7.0.
The experimental results are illustrated in Fig  5. From the figure above, we find that the bias decrement rapidly increases at the beginning period with the intensity increasing from 0.1 to 0.7. Then, between a wide range from 0.7 to 6.6, the bias decrement keeps relatively stable, oscillate in a  Table 3: Analysis on which representation space is best for debiasing. "To sent emb" indicates the perturbation is added to the sentence embedding space, while "To word emb" indicates the perturbation is added to the word embedding space. The perturbation intensity is set to 0.7. small range of 0.275 -0.325, reflecting the stability of our approach. However, when the perturbation intensity exceed some threshold (6.6 in this case), the bias decrement drops again. Meanwhile, with the perturbation intensity increasing, the classification accuracy of main task keeps falling (figure below), indicating that the perturbation with high intensity will also disturb the main task, leading to a low classification accuracy. The result provides a principle of how to choose a suitable perturbation intensity -the minimal intensity while effective enough for debiasing.

Which Representation Space to Apply Debiasing
Another pivotal consideration for our dynamically disentangling approach is -which representation space should we add the perturbation to? Typically, we have two choices: a) adding the perturbation to the sentence embedding space or b) adding the perturbation to the word embedding space. The sentence embedding is closer to the output space with the key information condensed into a single vector, while the word embedding is closer to the input side, keeping separated for each token. To check out which one performs better for social debiasing, we conduct experiments in the DIAL dataset with different bias ratio. Table 3 illustrates the experiment results. Compared the result of "To sent emb" to "To word emb", we found adding the perturbation to word embedding space often gains better debiasing results, especially when the bias ratio of the dataset is large. For example, when the bias ratio is 0.8, adding to word embedding space achieves a GAP T P R,RM S of 0.09, while adding to sentence embedding space achieves 0.21. We believe that, when applying our debiasing approach to a deeper representation  Table 4: Experimental results on accuracy and debiasing effect with different objectives of the protected classifier. We respectively apply the Cross-Entropy loss and the Entropy loss to the protected classifier when calculating the objective of the protected classifier for generating the perturbation for debiasing. Note that the protected classifier is pre-trained and fixed, and the entropy loss doesn't require ground truth protected attributes during the training of the main task. space, the perturbation is also context-aware (since the context encoder is also related when calculating the gradient) and thus more dynamic for the complex data distribution.

Cross-Entropy vs. Entropy
As mentioned in Section 2.3, we need to calculate a cross-entropy loss L protected to generate the debiasing perturbation via FGV. Thus, during the training of the main task, we must obtain the protected attribute for each training example to calculate the cross-entropy loss. This severely limits the usefulness of our approach, as it may be difficult to obtain the ground truth protected attribute when training the main task. To this end, we also propose to use the entropy loss (Zheng et al., 2020) to substitute the cross-entropy loss: where H indicates the Shannon entropy and P (y protected |x) is the distribution output by protected classifier. This objective forces the protected classifier to obtain high entropy, which means the classifier is not confident and almost distributed uniformly across all values of the protected attributes.
In Table 4, we compare the debiasing effectiveness of using entropy with cross-entropy. From the table, we observe that using the entropy objective also works for debiasing as the TPR-GAP also drops compared with the baseline. However, the debiasing effect still can't exceed our approach with cross-entropy. This seems reasonable since the cross-entropy objective introduces extra information about the protected attribute. With the extra supervision signal, our approach generates pertur-   bation towards a more precise direction for eliminating the representation of the protected attributes.

Performance on Different Bias Ratio
To more clearly show the performance differences of our model over data sets with varying degrees of bias, we introduce a new metric named Relative Improve Metric (RIM): where Acc and Acc represent the main task accuracy of the model before and after debiasing respectively, and GAP , GAP represent the TPR-GAP indicator of the model before and after debiasing respectively. RIM could synthetically reflect the stability of the main task and the debiasing performance of a debiasing method. We calculate the RIM indicator of our model and INLP based on the results in Table 1, and the new results are shown in Table 5. We can observe that the stronger bias in the dataset, the better performance of the two methods. Besides, we can find that our debiasing method is more robust.

Visualization
To better understand the effectiveness of our method, we display a feature visualization of sentence representations in Fig 6. We can observe that the different race classes are no longer linearly separable after debiasing. Therefore, downstream tasks can not make decisions conditioned on the race information in the representations.

Related Work
The objective of controlled removal of specific types of information from neural representations is tightly related to the task of disentanglement of the representations (Bengio et al., 2013), that is, controlling and separating the different kinds of information encoded in them. Previous models are either based on projection on a pre-specified, user-provided direction (Bolukbasi et al., 2016) or null-space (Xu et al., 2017;Ravfogel et al., 2020), or adding an additional gender discriminator (Xie et al., 2017;Elazar and Goldberg, 2018a), or the impact of data decisions (Beutel et al., 2017). The former first train an intermediate feature extractor on the main task, then use a separate projection method to remove social bias from the representations, finally finetune on the main task. Compared to these static debiasing methods, gender discriminator based methods (Elazar and Goldberg, 2018a;Zhang et al., 2018) use the traditional generative adversarial network (GAN) (Goodfellow et al., 2014) to remove protected gender attributes from encoded representations. However, they are notoriously hard to train (Ganin and Lempitsky, 2015). Elazar and Goldberg (2018a) has shown that the complete removal of the protected information is nontrivial: even when the attribute seems protected, different classifiers of the same architecture can often still succeed in extracting it. Therefore, in this paper, we aim to dynamically disentangle the social bias from the encoded representations while jointly training on the main task in a more stable way, rather than directly remove protected attributes. The main goal of debiasing is to prevent downstream models from utilizing these social biases in the representations, that is, dynamic disentanglement instead of complete removal.

Conclusion
In this paper, we focus on removing social bias in representation learning. We argue that the main goal of debiasing is to prevent downstream models from utilizing these social biases in the representation, that is, dynamic disentanglement instead of complete removal. Therefore, we propose an adversarial disentangled debiasing model to dynamically decouple social bias attributes from the intermediate representation trained on the main task. We perform extensive experiments and analysis to demonstrate the effectiveness of our method. We hope to provide new insights and directions towards solving social bias.

Broader Impact
In recent years, neural network based models have demonstrated remarkable performance in many natural language processing tasks and thus have been applied to a wide range of real-world applications. However, a lot of works reveal that such models are easily affected by social bias and thus makes incorrect and biased decisions. In domains with the greatest potential for societal impacts, using such biased models for real-world applications is dangerous and faces many problems such as human morality. The social bias implicit in the natural language processing model may be exposed and become a social hot spot, thus becoming an unstable factor that causes social unrest. Meanwhile, some existing debiasing methods, although able to slightly reduce bias in such model, often cause great damage to model performance in the main task, thus difficult to be applied in practice. This work proposes a new adversarial training method for end-to-end debiasing. Due to the robustness of the adversarial attack, the model can eliminates bias without losing much performance.