On Transferability of Bias Mitigation Effects in Language Model Fine-Tuning

Fine-tuned language models have been shown to exhibit biases against protected groups in a host of modeling tasks such as text classification and coreference resolution. Previous works focus on detecting these biases, reducing bias in data representations, and using auxiliary training objectives to mitigate bias during fine-tuning. Although these techniques achieve bias reduction for the task and domain at hand, the effects of bias mitigation may not directly transfer to new tasks, requiring additional data collection and customized annotation of sensitive attributes, and re-evaluation of appropriate fairness metrics. We explore the feasibility and benefits of upstream bias mitigation (UBM) for reducing bias on downstream tasks, by first applying bias mitigation to an upstream model through fine-tuning and subsequently using it for downstream fine-tuning. We find, in extensive experiments across hate speech detection, toxicity detection and coreference resolution tasks over various bias factors, that the effects of UBM are indeed transferable to new downstream tasks or domains via fine-tuning, creating less biased downstream models than directly fine-tuning on the downstream task or transferring from a vanilla upstream model. Though challenges remain, we show that UBM promises more efficient and accessible bias mitigation in LM fine-tuning.


Introduction
The practice of fine-tuning pretrained language models (PTLMs or LMs), such as BERT (Devlin et al., 2019), has improved prediction performance in a wide range of NLP tasks. However, finetuned LMs may exhibit biases against certain protected groups (e.g., gender and ethnic minorities), 1 Code and data: https://github.com/INK-U SC/Upstream-Bias-Mitigation 2 The work was partially done when Xisen Jin was an intern at Snap Inc.  Figure 1: Comparison between the focus of our study (d) and previous works (a,b,c). We study the viability of obtaining an upstream model that could reduce bias in a number of downstream classifiers when fine-tuned.
as models may learn to associate certain features with positive or negative labels spuriously (Dixon et al., 2018), or propagate bias encoded in PTLMs to downstream classifiers (Caliskan et al., 2017;Bolukbasi et al., 2016). Among many examples, Kurita et al. (2019) demonstrates gender-bias in the pronoun resolution task when models are trained using BERT embeddings, and Kennedy et al. (2020) shows that hate speech classifiers finetuned from BERT result in more frequent false positive predictions for certain group identifier mentions (e.g., "muslim", "black"). Approaches for bias mitigation are mostly applied during fine-tuning to reduce bias in a specific downstream task or dataset (Park et al., 2018;Zhang et al., 2018;Beutel et al., 2017) (see Fig. 1 (a)). For example, data augmentation approaches reduce the influence of spurious features in the original dataset (Dixon et al., 2018;Zhao et al., 2018;Park et al., 2018), and adversarial learning approaches generate debiased data representations that are exclusive to the downstream model (Kumar et al., 2019;Zhang et al., 2018). These techniques act on biases particular to the given dataset, domain, or task, and require new bias mitigation when switching to a new downstream task or dataset. This can require auxiliary training objectives, the definition of task-specific fairness met-rics, the annotation of bias attributes (e.g., identifying African American Vernacular English), and the collection of users' demographic data. These drawbacks make bias mitigation inaccessible to the growing community, fine-tuning LMs to new datasets and tasks.
In contrast, we investigate initially mitigating bias while fine-tuning an "upstream" model in one or more upstream datasets, and subsequently achieving reduced bias when fine-tuning for downstream applications ( Fig. 1 (d)), so that bias mitigation is no longer required in downstream training. Similar to transfer learning for enhancing predictive performance in common setups (Pan and Yang, 2010;Dai and Le, 2015), we suggest that LMs that undergo bias mitigation acquire inductive bias that is helpful for reducing harmful biases when fine-tuned on new domains and tasks. In four tasks with known bias factors -hate speech detection, toxicity detection, occupation prediction from short bios, and coreference resolution -we explore whether upstream bias mitigation of a LM followed by downstream fine-tuning reduces bias for the downstream model. Though previous work has addressed biases in frozen PTLM or word embeddings (Bolukbasi et al., 2016;Zhou et al., 2019;Bhardwaj et al., 2020;Liang et al., 2020;Ravfogel et al., 2020), for example by measuring associations between gender and occupations in an embedding space, they do not study their effect on downstream classifiers ( Fig. 1 (b)), while some of them study the effects while keeping the embeddings frozen Kurita et al., 2019;Prost et al., 2019). Bias in these frozen representations can also be directly corrected by removing associations between feature and sensitive attributes (Elazar and Goldberg, 2018;Madras et al., 2018) (Fig. 1 (c)), but this does not allow predictions to be generated for new data.
Our experiments address the following research questions: (a) whether mitigating a single bias factor in the upstream stage is maintained when finetuning on new examples from the same domain and task, (b) whether transfer is viable when the downstream domains and tasks are different from the upstream model, and (c) whether we can address multiple kinds of bias with a single upstream model. We perform these experiments under a generic transfer learning framework, noted as Upstream Bias Mitigation (UBM) for Downstream Fine-Tuning for convenience, which consists of two stages: first, in the upstream bias mitigation stage, a LM is fine-tuned with bias mitigation objectives on one or several "upstream" tasks, and subsequently the classification layer is re-initialized; then, in the downstream fine-tuning stage the encoder from the upstream model, jointly with the new classification layer, are again fine-tuned on a downstream task without additional bias mitigation steps. Using six datasets with previously recognized bias factors, our analysis show overall positive results for the questions above; still, there are challenges remaining to stabilize the results of bias mitigation in challenging setups, e.g., the multi-bias factor setting.
Our contributions are summarized as follows: (1) we propose a new research direction for mitigating bias in fine-tuned models; (2) we perform extensive experiments to study the viability of the upstream bias mitigation framework in various settings; (3) we demonstrate the effectiveness of this research direction, motivating further improvements, tests, and applications.

Exploring the Transferability of Bias Mitigation Effects
We consider biases against protected groups in classifiers fined-tuned from LMs. In our present analysis, bias is defined as disparate model performance on different subsets of data which are associated with different demographic groups (e.g., instances that mention or are generated by different social groups) (Blodgett et al., 2020). Our evaluation of bias aligns with the definition of equalized odds and equal opportunities (Hardt et al., 2016) in previous works of fairness in machine learning.
Here, we first outline our experimental setup for exploring the transferability of bias mitigation effects, in which we detail the process of applying UBM and pose three key research questions (section 2.1). We follow by introducing the bias factors studied and the corresponding classification tasks and datasets (section 2.2), and our evaluation protocols and metrics (section 2.3).

Experiment Setups of UBM
Our goal is to evaluate the transferability of bias mitigation effects for one or multiple bias factors in downstream fine-tuned models. We follow an Upstream Bias Mitigation (UBM) for Downstream Fine-Tuning procedure, pictured in Figure 2. First, in the Upstream Bias Mitigation phase, an upstream (1) Upstream bias mitigation stage (2) Downstream fine-tuning stage ! " " = ℎ " ∘ " Figure 2: Experiment setups to study Upstream Bias Mitigation (UBM) for Downstream Fine-Tuning. We consider the settings with the same or different upstream and downstream domains and tasks, while addressing one or more bias factors (e.g., both dialect bias and gender bias). The framework consists of two stages: (1) an upstream (source) model f s = h s • g s is trained with bias mitigation algorithms and (2) the encoder g s is transferred to the downstream (target) model f t for fine-tuning.
(source) model f s = h s • g s , composed of a text encoder g s and a classifier head h s , is trained on one or more upstream datasets D s with bias mitigation algorithms. The encoder g s is to be transferred to downstream (target) domains and tasks while the classifier head h s is discarded. Then, in the Downstream Fine-Tuning phase, the downstream model f t = h t • g t utilizes g s to initialize the encoder weights and is fine-tuned for prediction performance without bias mitigation approaches on downstream datasets D t . This UBM process is applied in three settings, summarized below, which each contribute to evaluating the transferability of bias mitigation effects. 1. Fine-Tuning on the Same Distribution. In the simplest setting, we fine-tune the downstream model over new examples from the same data distribution as the upstream model. In practice, each dataset is split into two halves, with one used for upstream bias mitigation and the other for downstream fine-tuning. 2. Cross-Domain and Cross-Task Fine-Tuning. Similar to how LMs are fine-tuned for various tasks and domains, in a more practical setup, we test whether transfer of bias mitigation effects is viable across domains and tasks. To achieve this, we apply bias mitigation while fine-tuning a LM on one dataset and perform fine-tuning on another. 3. Multiple Bias Factors. In the most challenging setup, we train a single upstream model to address multiple bias factors (e.g., both dialect bias and gender bias). Such upstream models can be trained with multi-task learning (i.e., jointly training over multiple datasets with shared encoder g but different classifier heads h) while mitigating multiple kinds of bias. Subsequently, the resulting upstream model is transferred to downstream models as be-  fore. This is a key test of UBM's viability for widespread application.

Bias Factors and Datasets
To ensure our analysis holds true for a variety of domains, tasks, and bias factors, we experiment with three different bias factors studied in previous research along with six different datasets (also summarized in Table 1), described below. Group Identifier Bias. This bias refers to higher false positive rates of hate speech predictions for sentences containing specific group identifiers, which is harmful to protected groups by misclassifying innocuous text (e.g., "I am a Muslim") as hate speech. We include two datasets for study, namely the Gab Hate Corpus (GHC; Kennedy et al., 2018) and the Stormfront corpus (de Gibert et al., 2018). Both datasets contain binary labels for hate and non-hate instances, though with differences in the labeling schemas and domains. AAVE Dialect Bias. Sap et al. (2019) show that offensive and hate speech classifiers yield a higher false positive rate on text written in African American Vernacular English (AAVE). This bias brings significant harm to the communities that uses AAVE, for example, by leading to the disproportionate removal of the text written in AAVE in social media platforms (Blodgett et al., 2020). We include two datasets for study: FDCL (Founta et al., 2018) and DWMW (Davidson et al., 2017). In both datasets, we treat abusive, hateful and spam to-gether as harmful outcomes (i.e., false positives for each are harmful) to compute false positive rates. Following Sap et al. (2019), we use an off-the-shelf AAVE dialect predictor (Blodgett et al., 2016) to identify examples written in AAVE. Gender Stereotypical Bias. Zhao et al. (2018) summarize a list of occupations that are prone to be stereotyped in practice, leading to coreference resolutions models and occupation prediction models having biases in performance in pro-and antistereotypical instances when trained on short bios. We train the coreference resolution model on the OntoNotes 5.0 dataset (Weischedel et al., 2013) and the occupation classifier on the BiasBios (De-Arteaga et al., 2019) dataset.

Evaluation Protocol and Metrics
We evaluate the overall performance of the models on downstream tasks along with appropriate bias metrics for each bias factor, analyzed for each dataset and task in previous works. We expect UBM to minimally affect classification performance while improving on bias metrics. Classification Performance. We report in-domain F1 scores for GHC, Stormfront, OntoNotes 5.0, and accuracy scores for FDCL, DWMW and BiasBios. Following Zhang et al. (2018), for hate speech detection and toxicity detection datasets, we use the equal error rate (EER) threshold for prediction. Group Identifier Bias Metrics. To evaluate group identifier bias, we evaluate false positive rate (FPR) differences, noted as FPRD, between examples mentioning one of 25 group identifiers provided by Kennedy et al. (2020) and the overall FPR. In addition, we followed Kennedy et al. (2020) in using a New York Times articles (NYT) corpus of 25k non-hate sentences, each mentioning one of 25 group identifiers. This corpus specifically provides an opportunity to measure FPR-reported as (NYT Acc.), equivalent to 1−FPR. Additionally, following the evaluation protocol of Dixon et al. which is a large unlabeled collection of Twitter posts written in l. Since in practice only a small portion of texts are toxic or spam, we treat all examples from BROD as normal, and report the accuracy (which equals 1−FPR) on the dataset.
Gender Stereotype Metrics. We employ the WinoBias (Zhao et al., 2018) dataset which provides opportunities to evaluate models on prostereotypical and anti-stereotypical coreference examples. We report the differences in F1 (F1-Diff) on two subsets of data. On occupation prediction, following Ravfogel et al. (2020), we report mean differences of true positive rate (TPR) differences in predicting each occupation for men and women.

Method
Here, we detail the particular bias mitigation algorithms used for implementing UBM, as well as the other baselines used for verifying the transferability of bias mitigation effects.

Implementations of UBM
We implement UBM with two different bias mitigation algorithms in the upstream bias mitigation phase: explanation regularization (Kennedy et al., 2020), and adversarial de-biasing (Zhang et al., 2018;Madras et al., 2018;Xia et al., 2020), denoted here as UBM reg and UBM adv , respectively.
UBM with Explanation Regularization. Explanation regularization reduces importance placed on spurious surface patterns (i.e., words or phrases) during upstream model training. We apply UBM reg to group identifier and AAVE dialect bias, where the set of spurious patterns are group identifiers and the most frequent words, from statistics of the dataset, used by AAVE speakers; we find explanation regularization not effective for gender bias. The importance of a surface pattern w ∈ W in the input x, noted as φ(w, x) is measured as the model prediction change when it is removed. The model is trained by optimizing the main learning objective while penalizing importance attributed to patterns w ∈ W that exist in the input x.
where α is a trade-off hyperparameter.
UBM with Adversarial De-biasing. In UBM adv , the upstream model is trained with adversarial debiasing techniques, so that sensitive attributes related to bias (e.g., the dialect of the sentence or the gender referenced in the sentence) cannot be predicted from the hidden representations z given by the encoder g. During training, an adversarial classifier head h adv is built upon the encoder and trained to predict sensitive attributes, while the encoder is optimized to prevent the adversarial classifier from success. Formally, the optimization objective is written as, where a notes the ground truth sensitive attribute, and adv is the cross entropy loss between the predicted sensitive attribute and the ground truth sensitive attribute.
As mentioned in Sec. 2.1, upstream models can be trained to mitigate multiple bias factors with multi-task learning on multiple datasets. We separately apply bias mitigation algorithms for each dataset (sharing the same encoder) and note the algorithms applied in the subscript (e.g., UBM reg+adv ).

Other Baselines
We compare UBM with two families of methods. Methods without Bias Mitigation. Two types of models were evaluated that did not address bias. First, the Vanilla model is a downstream classifier directly fine-tuned on downstream task from a LM (e.g., RoBERTa). Second, Van-Transfer is finetuned on upstream datasets without bias mitigation and fine-tuned on downstream datasets. Downstream Bias Mitigation. For reference, we show the results of directly applying explanation regularization, noted as Expl. Reg., or adversarial de-biasing, noted as Adv. Learning, during downstream fine-tuning. In most cases, mitigating bias in downstream classifier should be the most effective way to reduce bias, though this is not always feasible in practice for reasons discussed above.
We also consider two simple baselines that could reduce bias in downstream models via heuristics. Emb. Zero zeros out the word embedding of spurious surface patterns (using the same word list as explanation regularization) in PTLMs before finetuning. We also include Emb. Zero. Trans, which GHC B

Metrics
In-domain F1 (↑)  Table 2: Same-domain and task UBM with a single bias factor. The source datasets are noted before arrow (→). All metrics except In-domain F1 measure bias. See Table 6 in Appendix for complete results.
zeros out embeddings of spurious surface patterns before fine-tuning from an upstream model. The method does not apply to cases where surface patterns related to bias (e.g., gendered pronouns) are crucial for prediction, e.g., coreference resolution.

Results
In this section, we present the results of UBM in three settings following the order in Sec. 2.1: transferring to the same data distribution, transferring to different data distributions, transferring from an upstream model with bias mitigation for multiple bias factors. We follow these main analyses with an investigation of the impact of freezing encoder weights before downstream fine-tuning, and lastly with a brief exploration of how UBM's positive results are achieved. Implementation Details. In all experiments reported on below, models are initially fine-tuned from RoBERTa-base. The upstream model is trained for a fixed number of epochs and the checkpoint with the best prediction performance is transferred to the downstream model. See Appendix for more implementation details. We use D s → D t as the transfer notation, in which upstream and downstream datasets are respectively represented in the left and right-hand side of the arrow.

UBM with the Same Data Distribution
We first briefly show the results when the downstream model sees new, unseen samples from the same data distribution as the upstream model. In this controlled setting, we isolate and test the basic viability of UBM, which requires that information from the upstream model is retained during downstream fine-tuning. GHC, Stormfront, FDCL and BiasBios were partitioned into two subsets with equal size, noted as subsets A and B of corresponding datasets, to train the upstream and downstream models respectively.   Table 2 presents the results for mitigating group identifier bias in the GHC. We see an overall bias reduction, via UBM, by comparing with Vanilla training and Van-Transfer. We include full results and discussions for this simple setting in Appendix.

Cross-domain and Task UBM
Following the result that UBM is effective in the same-domain setting, we now move to analyzing cross-domain settings in greater depth. For hate speech classification, we perform transfer learning from GHC to Stormfront and from Stormfront to GHC; and for toxicity classification, we perform transfer learning from FDCL to DWMW. We also perform transfer learning from BiasBios (occupation prediction) to OntoNotes 5.0 (coreference resolution). Table 3 shows the results of cross-domain and task transfer learning and non-transfer baselines. Our findings are summarized below.
UBM can reduce bias in different target domains and tasks compared to fine-tuning without bias mitigation. The results of crossdomain and task transfer learning (i.e., Stf.→GHC, GHC→Stf., FDCL→DWMW), show that transferring from a less biased upstream model (UBM Reg and UBM Adv ) leads to better downstream bias mitigation compared to directly training without bias mitigation in the target domain (Vanilla). Meanwhile, the in-domain classification performance has improved (on GHC and Stormfront) or been preserved (on DWMW). It is notable that directly mitigating bias (Expl. Reg., Adv. Learning) on DWMW is not effective, which is previously observed by Xia et al. (2020), while transferring from FDCL is successful.
There are exceptions where UBM fails to reduce bias. We see the in-domain FPRD on Stormfront does not improve; however, as discussed in our metrics section, the in-domain FPRD is computed over a much smaller set of examples compared to NYT and IPTTS datasets, and is thus less reliable. UBM does not reduce bias compared to Vanilla training on OntoNotes 5.0, but achieves less bias compared to Van-Transfer. This result confirms the effect of bias mitigation in upstream models, but the transfer learning itself has increased the bias.
Comparison with Emb. Zero and Emb. Zero. Trans. We find two alternative methods, Emb. Zero and Emb. Zero Trans, also reduce bias on some of the datasets. On GHC, Emb. Zero achieves an in-domain FPRD and IPTTS-FPRD lower than UBM. However, it comes with clear drop of indomain classification performance.

Mitigating Multiple Bias Factors
Having observed an overall positive effect of UBM across domains and tasks, next we present the results of experiments on mitigating multiple bias factors with a single upstream model. This involves training an upstream model with multiple bias mitigation objectives across multiple datasets, followed by fine-tuning on a single dataset without bias mitigation. We test three combinations of datasets. First, a multi-task model is trained to jointly mitigate group identifier bias and AAVE dialect bias using GHC and FDCL (GHC + FDCL), and transferred to Stormfront and DWMW. Next, a model is similarly trained jointly on group identifier and AAVE biases on and Stormfront and FDCL (Stf. + FDCL) and transferred to GHC and DWMW. Lastly, models were trained over source datasets    Table 4. Comparison to Single-Dataset Vanilla Baselines. As a basic measure of bias mitigation success, we compare multi-dataset models' results with single-dataset Vanilla training and Van-Transfer. We see UBM with GHC + FDCL successfully reduces both group identifier bias and AAVE dialect bias in downstream models. UBM with GHC + FDCL + BiasBios also successfully reduces group identifier bias in terms of IPTTS, FPRD (which is the most reliable metrics of bias given its large size), and AAVE bias. It also reduces gender stereotypical bias compared to Van-Transfer in some experimental runs, but in an unstable manner, demonstrated by the large variance of F1-Diff and degenerated runs of UBM Reg,Reg,Adv .
Results of UBM on Stf. + FDCL are less promising. We find UBM Reg,Adv,Adv is not successful in reducing group identifier bias. UBM Reg,Reg,Adv could reduce bias on IPTTS-FPRD, but does not improve other metrics. Notably, UBM on Stf. + FDCL clearly underperform UBM on Stf. only.
UBM Reg versus UBM Adv . Empirically, we find using explanation regularization on FDCL (UBM reg,reg , UBM reg,reg,adv ) instead of adversarial learning (UBM reg,adv , UBM reg,adv,adv ) consistently improves bias mitigation performance on other bias factors.
Takeaways. Our results show it is possible to reduce multiple bias factors via UBM. However, we have shown that these effects are not automatic for each new dataset added to upstream models for multi-task bias mitigation.

Freezing or Regularizing Model Weights
In the experiments above, we have shown that the effect of mitigating bias is partially preserved with simple fine-tuning. Next, we study whether freezing the encoders or discouraging their weight changes improves bias mitigation in the target domain, as they intuitively try to retain effect of bias mitigation. However, we find a counterintuitive result: these approaches typically do not achieve reduced downstream bias, and in fact reduce in-domain classification performance. Table 5 shows the results when we keep the weights frozen (Freeze), discouraging weights from changing with to capture simple but spurious correlations.

Investigating Why UBM Reduces Bias
We attempt to interpret why fine-tuning from a debiased upstream model remains less biased during fine-tuning from the perspective of gradient of importance attributed to words w related to bias factors (e.g., group identifiers) by the input occlusion algorithm. A large importance attribution usually induces bias. Figure 3 plots the importance attribution of group identifiers φ(w, x) and the norm of its gradient w.r.t. parameters θ of the encoder g, noted as ||∇ θ φ(w, x)|| 2 . UBM reduces the gradient of φ(w, x), so that φ(w, x) is less likely to change at the beginning of downstream fine-tuning. Fig. 3 shows UBM has not only reduced value of importance attributed to spurious patterns, but also reduced their gradients. The gradient norm is highly indicative about how the importance φ(w, x) will change in the downstream model, because when the loss in Eq. 1 in the upstream model is minimized, the gradient ∇ θ φ(w, x) has the same norm but the opposite direction as the main downstream classification objective ∇ θ c . It implies that whether the upstream model converges at an optimum where both objectives agree (i.e., gradients are small) can be an important indicator of the success of UBM.
The figure further shows that the gradient and the value of φ(w, x) remain small for UBM reg over the whole training process. We leave more study into the training dynamics of UBM as future works.

Related Works
Here we review approaches that inform the present work (techniques for bias mitigation) and are related to the basic idea of UBM.
Mitigating bias in representations. Bias can be mitigated directly in representations of data. Zhang et al. (2018); Beutel et al. (2017) proposed training a classifier together with an adversarial predictor for sensitive attributes. Madras et al. (2018) further studied re-usable de-biased representations by training a new downstream classifier (potentially with a different classification task) using the learned representations. However, this practice relies on frozen representations (rather than models themselves), which precludes the possibility of generating predictions for new data.
Mitigating bias in pretrained models. Another line of work addresses bias in pretrained models (e.g., word vectors, BERT, Zhou et al., 2019;May et al., 2019;Bhardwaj et al., 2020;Liang et al., 2020). Many such studies again focus on bias in frozen data representations, and do not study their effects on downstream classifiers. Others alternatively assess the propagation of bias from pretrained models to downstream classifiers: Ravfogel et al. (2020) study algorithms for mitigating bias in pretrained models by de-biasing the learned representations, which can subsequently be used in classifiers as frozen representations.
Transferring learning of fairness and robustness. A few previous works have studied related research problems, with significant differences to our work. Though Schumann et al. (2019) theoretically analyzes the transferability of fairness across domains, it assumes simultaneous access of source and target domain data, which does not account for transferring upstream bias mitigation to arbitrary downstream fine-tuned models. Shafahi et al. (2020) study transfer learning of robustness to adversarial attacks under fine-tuning, but do not seek to mitigate bias.

Conclusion
We observe that the effects of bias mitigation are indeed transferable in fine-tuning LMs. Future works in fine-tuning LMs can use UBM in order to easily apply the positive effects of bias mitigation methods to new domains and tasks without customized bias mitigation processes or access to sensitive user information. Though UBM does not rival directly mitigating bias on the downstream task, it is more efficient and accessible. Future works can develop the effectiveness of UBM beyond the default scenarios in this paper, and potentially apply it to tasks and settings beyond hate speech, toxicity classification, occupation prediction, and coreference resolution in English corpora.

Broader Impact Statement
Our analysis demonstrates the effectiveness of Upstream Bias Mitigation for Downstream Fine-Tuning. As we stated in the paper, the reduced efforts of downstream bias mitigation will facilitate broader application of bias mitigation in the growing deep learning community.
While we may expect to obtain an "off-the-shelf" language model that could reduce multiple kinds of bias with UBM, we emphasize that proper evaluation of bias may still be required in downstream side, especially for guaranteed bias mitigation. Currently, our initial analysis of UBM confirms that bias mitigation effects are transferable, but does not provide guarantees of bias mitigation or levels of bias mitigation in the direct setting. The findings in this analysis should identify the potential of UBM to the broader NLP and machine learning communities, which may be extended with new approaches within the UBM framework, or interpretation techniques (as in Sec. 4.5). We use RoBERTa-base as our base model. In the bias mitigation phase, models for GHC, Stormfront, FDCL, DWMW, and BiasBios are trained with a learning rate 1e −5 , and the checkpoint with the best validation F1 or accuracy score is provided to the fine-tuning phase. We train on GHC, FDCL, DWMW, BiasBios for maximum 5 epochs and Stormfront for maximum 10 epochs . The checkpoint with the best validation in-domain classification performance is kept. In the fine-tuning phase, we try the learning rate 1e −5 and 5e −6 , and report the results with a higher validation in-domain classification performance. For the coreference resolution model on OntoNotes 5.0, we adapt existing code implementation 4 (Joshi et al., 2019) to support loading RoBERTa-base as the base model. We use the same hyperparameter settings as BERT-base in the provided code implementation.
To report mean and standard deviation of performance are computed over 3 runs for most of the experiments, with the same set of random seeds; for GHC and Stf. experiments in Table 3, and UBM reg,reg,adv on OntoNotes 5.0, we run experiments for 6 runs. Models except coreference resolution models on OntoNotes, are trained on a single GTX 2080 Ti GPU. Coreference resolution models are trained on a single Quadro RTX 6000 GPU.
The training time per iteration is consistent over experiments in about 1.5 iteration per second, except the conference resolution. The training of coreference resolution model on OntoNotes 5.0 takes around 8 hours. The largest dataset among other datasets, BiasBios, takes 2 hours to train.

A.2 Details of Bias Mitigation Algorithms
For explanation regularization algorithm, we set the regularization strength α as 0.03 for GHC and Stormfront experiments, and 0.1 for FDCL and DWMW experiments. We regularize importance score on 25 group identifiers in (Kennedy et al., 2018) for GHC and Stormfront. These group identifiers the ones that have the largest coefficient in a bag-of-words linear classifier. For FDCL, we extract 50 words with largest coefficient in the bagof-words linear classifier with a AAE dialect probability higher than 60% (given by the off-the-shelf AAE dialect predictor (Blodgett et al., 2016)) on Table 7: Dealing with multiple bias factors with a single upstream model with UBM, where the domains and tasks are the same in the upstream and the downstream model. and show whether the metrics has increased or decreased (both imply improvement) compared to non-transfer Vanilla training in Table 3.
2 -sp regularizer is written as Ω(w) = β||w − w 0 || 2 2 , appended to the learning objective. β is a hyperparameter controlling the strength of the regularization. We reported results where β = 1. We tried different values of β from 1e −6 to 100, increasing β by 10 times each time, but we do not see changes in the conclusion. Table 6 show the results of same-domain transfer with a single bias factors. Table 7 further show the results of addressing multiple bias factors in this setup. On GHC, Stormfront, and BiasBios, UBM overall reduces bias compared to Vanilla and Vanilla-Transfer. We notice the NYT accuracy on Stormfront in Stf. A → Stf. B setup is an exception. However, we see the bias is not reduced on Stf. B even when we directly run explanation regularization in the target domain. We reason that the Half-Stormfront dataset is small and the average length of the sentences are quite different between Stormfront and NYT, so that a model trained on  Stormfront hardly generalizes to NYT.

B Complete Analysis of UBM over the Same Data Distribution
We find intriguing results on FDCL; From FDCL A → FDCL B in Table 6, we find bias is not reduced with UBM. However, as shown in Table 7, when the upstream model is trained jointly with other datasets to reduce multiple bias factors (Stf A + FDCL A, GHC A + FDCL A, GHC A + FDCL A + BiasBios A), the bias is clearly reduced.

C Applying UBM with Downstream Bias Mitigation
In Table 8, we report the performance of performing both upstream and downstream bias mitigation, compared with downstream bias mitigation only, and downstream bias mitigation over a vanillatransferred model. We see UBM further reduced bias in the Stormfront → GHC setup, while fail to improve in GHC → Stormfront. Compared to our previous results in Tables 3 and 4, we see a clearer directionality of transfer of bias mitigation effects when downstream bias mitigation is also applied.