Towards a Comprehensive Understanding and Accurate Evaluation of Societal Biases in Pre-Trained Transformers

The ease of access to pre-trained transformers has enabled developers to leverage large-scale language models to build exciting applications for their users. While such pre-trained models offer convenient starting points for researchers and developers, there is little consideration for the societal biases captured within these model risking perpetuation of racial, gender, and other harmful biases when these models are deployed at scale. In this paper, we investigate gender and racial bias across ubiquitous pre-trained language models, including GPT-2, XLNet, BERT, RoBERTa, ALBERT and DistilBERT. We evaluate bias within pre-trained transformers using three metrics: WEAT, sequence likelihood, and pronoun ranking. We conclude with an experiment demonstrating the ineffectiveness of word-embedding techniques, such as WEAT, signaling the need for more robust bias testing in transformers.


Introduction
Transformer models represent the state-of-the-art for many natural language processing (NLP) tasks, such as question-answering (Devlin et al., 2019), dialogue (Smith et al., 2020), search results (Nayak, 2019), and more. Popular pre-trained models, such as those available from Hugging Face (Wolf et al., 2019), allow developers without extensive computation power to benefit from these models. However, it is important to fully understand the latent societal biases within these black-box transformer models. Without appropriately considering inherent biases, development on top of pre-trained transformers risks exacerbating and propagating racial, gender, and other biases writ large.
Before transformers, word embedding models such as Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) were shown to exhibit systematic sexist (Bolukbasi et al., 2016) and * These authors contributed equally to this work racist (Manzini et al., 2019) biases. Initial investigations into bias for transformers (Vig et al., 2020;Basta et al., 2019;Bommasani et al., 2020) have found that these new language models are similarly biased. As transformers are increasingly commonplace, a more complete view of the inequalities, biases, or under-representations within pre-trained transformers becomes increasingly important.
Yet, discovering bias in transformer models has proven to be more nuanced than bias-discovery in word embedding models (Kurita et al., 2019;May et al., 2019). Prior work on bias in modern transformer models has used only a single test or metric at a time, which we show in this paper provides an incomplete view of the problem. Furthermore, we find evidence that certain tests are ill-suited to understanding bias in transformer architectures, supported by prior work (Blodgett et al., 2020). Moreover, we show that employing multiple tests is necessary for a full picture of the issue as no single test is currently sufficient.
In the context of our work, "bias" refers specifically to the preference of a model for one gender or race in the presence of an otherwise neutral context. As an example, consider the sequence " [MASK] wept upon arriving to the scene." With no additional information, an equitable system would exhibit no preference for female over male, or African-American over European-American names; however, our results indicate that there is often a statistically significant preference (p < 0.0001) for associating female and African-American identifiers with being more "emotional." We provide two key contributions to understanding and mitigating bias in contextual language models. First, we conduct a comprehensive, comparative evaluation of gender and racial bias using multiple tests for widely-used pretrained models. Second, we construct a novel experiment for debiasing a contextual language model on a downstream task (Zellers et al., 2018)

Related Work
After the seminal work of Bolukbasi et al. (2016), bias has been found ubiquitous in word embedding models (Amorim et al., 2018;Brunet et al., 2018;Rudinger et al., 2018;Zhao et al., 2017;Silva et al., 2020). Researchers have applied association tests between word embeddings to look for inappropriate correlations. Caliskan et al. (2017) introduce the Word Embedding Associate Test (WEAT) to estimate implicit biases in word embeddings by measuring average cosine similarities of target and attribute sets. The WEAT has been extended into a sequence test (May et al., 2019), though the efficacy of both tests remains in question for transformers (Ethayarajh et al., 2019;Kurita et al., 2019). Prior work has also devised methods to measure contextual bias. Kiritchenko and Mohammad (2018) introduce the Equity Evaluation Corpus (EEC), which includes templated sequences such as " TARGET feels ATTRIBUTE ," where gendered or racial tokens are the "targets" and emotional words are the "attributes." The average of the difference in likelihoods for target sets constitutes the bias score. We leverage this in our work as the sequence ranking test (SEQ). Kurita et al. (2019) and Vig et al. (2020) devise a pronoun-ranking test for BERT by comparing relative likelihoods of target words. Rather than sequence likelihood, the authors instead measure contextual likelihood, which helps to control for a model's overarching bias. We extend this work, applying the pronoun-ranking test (P N ) to score the most commonly used transformer models and contextualizing the results with SEQ scores.
Investigations of biases in contextual language models, e.g. transformers, have yielded mixed results. Basta et al. (2019) found that BERT and GPT exhibit a reduced bias-dimension relative to word embedding models, whereas Kurita et al. (2019) found that BERT is biased and that conventional tests, e.g. WEAT, are inappropriate. Recent work has also looked to identify bias by crowdsourcing a sterotype dataset (Nadeem et al., 2020;Zhao et al., 2018;Nangia et al., 2020). These approaches develop a bias analysis metric by empirically computing a pretrained model's preference towards stereotyped sentences. However, such work is specifically focused on showcasing the effectiveness of these specific datasets for identifying bias. Our results paint a more complete picture, providing insight into specific aspects of gender and racial bias and unifying disparate viewpoints of prior work. Furthermore, we present a targeted investigation into the relevance of the WEAT for transformers.

Approach and Results
We apply three tests (i.e. the WEAT (W ), sequence likelihood (SEQ), and pronoun ranking (P N )) to popular pre-trained transformers from Hugging Face (Wolf et al., 2019), including the cased and uncased 1 BERT and DistilBert models, the uncased ALBERT models, and the cased RoBERTa, DistilRoBERTa, GPT-2, and XLNet models. For gender, we compare the WEAT tests for career (W C ), math (W M ), and science (W S ), against the sequence likelihood and pronoun ranking tests for anger (SEQ A and P N A ), fear (SEQ F and P N F ), sadness (SEQ S and P N S ), and joy (SEQ J and P N J ) evaluated between male and female target words. For race, we use the only WEAT available for race (W R ) as well as the same SEQ and P N tests evaluated between African-American and European-American targets. The results of our WEAT, sequence likelihood, and pronoun ranking bias tests are presented in Tables 1 and 2. The quantity listed for each model/test pair is the effect size for that two-sided t-test test under they hypothesis that there is a significant difference between the mean likelihoods across the two groups. Using multiple tests is important; many models exhibit systematic preference for one target according to SEQ, while the P N reveals contex-1 Casing is a design decision affecting the tokenization for a model. For all models, we test every size available. tual preference in a different direction. The models often assign higher likelihood to male sequences, but when specifically considering the subject of an emotional sentence, female subjects are more likely. To address inherent model bias, it is important to understand how this bias manifests which we discuss below.
Model size and bias -Examining the SEQ and P N results for distilled models DistilBERT and DistilRoBERTa, we see that these models almost always exhibit statistically significant bias and that the effect sizes for these biases are often much stronger than the original models from which they were distilled (BERT and RoBERTa). This finding is in line with contemporary work by Hooker et al. (2020), who show that distillation in vision models disproportionately harms underrepresented groups. We show that the same is true for transformers.
The opposite is not true: increasing model capacity does not remove bias. While prior work (Gilburt, 2019;Tan and Celis, 2019) has reported increasing model size correlates with decreasing bias, we find that this is not always the case (see GPT2-Base vs. GPT2-Large), as supported by Nadeem et al. (2020)   race and gender, the uncased models exhibit less bias and greater diversity for names and pronouns. The effects of tokenization may also play a role in WEAT's underperformance, as the meanembeddings used to estimate a WEAT effect do not accurately reflect the expected words for the test. For example, under the ALBERT tokenizer, "Nichelle" becomes "niche" and "lle", two subwords which may not average out to a name.
WEAT is inconsistent -We find that WEAT is a poor predictor of contextual bias and an internallyinconsistent metric. The WEAT for math (W M ) and science (W S ) use words which are very similar and, at times, even overlapping. As such, we would expect the W M and W S scores to indicate bias in the same direction for every model. Instead, we see that the WEAT results show differing magnitudes and occasionally point in different directions.
Given the inconsistency of WEAT and its poor correlation with SEQ and P N effects, we propose a debiasing scheme using the WEAT effect. If neutralizing the WEAT effect also neutralizes SEQ and P N bias, then the WEAT remains a useful test for transformers. However, if neutralizing the WEAT has no effect on the SEQ and P N scores, we can conclude that the WEAT is simply not appropriate for contextual models.

Debiasing Transformers with WEAT
We now employ WEAT scores as a loss regularizer to "de-bias" a RoBERTa model being trained on the Situations With Adversarial Generations (SWAG) dataset, a commonsense inference dataset in which each sample is a sentence with four possible endings (Zellers et al., 2018). The SWAG training objective is to minimize the model's cross-entropy loss, L M C , for choosing the correct ending. In addition to this loss, we incorporate WEAT scores as a regularizer, as shown in Equation 1. Here, λ w is a hyper-parameter, and W M , W R , W C , W S are the WEAT scores for each category. We hypothesize that, even if a model is able to minimize WEAT effects, the model will remain significantly biased. (1)

Results
We measure the accuracy of our fine-tuned models on SWAG and find that the debiased model exhibits competitive accuracy. The WEAT-regularized model achieves 82.2% accuracy, compared to 82.8% for a human (Zellers et al., 2018) and 83.3% for the best RoBERTa-base model. The results from the WEAT regularization are in Table 3. Table 3 shows that fine-tuning with SWAG alone (without any bias regularizers) yields significant bias toward male and African-American SEQ tests (8/8 attribute tests show significance), and female and African-American for P N tests (4/8 attribute tests show significance). Furtheremore, we find that even though our "de-biased" model shows ≈ 0 effect for WEAT, Table 3 shows that this model remains significantly biased on both the SEQ and P N tests. De-biasing with WEAT has exaggerated gender bias for the P N test compared to the SWAG-only model, whereas for the SEQ tests the bias has been flipped to being significantly biased towards female. Tests for racial bias are likewise reflective of this trend. These results demonstrate that the WEAT is an insufficient measure of bias. Neutralizing word-piece embeddings does not remove the contextual aspect of bias learned by RoBERTa and may even exacerbate biases.

Discussion
Our results demonstrate that bias is a significant problem for nearly all pre-trained models. Unfortunately, the problem is not simply solved by using larger networks or more data. As shown in Tables 1 & 2, the approach with the most data, RoBERTa, is among the most consistently biased transformers in our study, while the largest model, GPT-2 XLarge, exhibits greater bias than GPT-2 Base. Tokenization also has an immense impact on the equitable use of language models, and is often overlooked within discourse surrounding bias. We encourage the community to consider these effects on minority communities whose names or vernacular will be distorted more than majority communities due to the nature of word-piece tokenization.
Developing tests that can contextually identify bias within transformers remains vital. Our "debiasing" results show that relying on ill-fitting tests can lead to harmful false positives. We show that "successfully" de-biasing a model via a WEAT regularizer results in continued or even amplified bias on both the SEQ and P N tests, despite that nearzero WEAT effects. We conclude that contextuallyand globally-sensitive bias tests are needed for future debiasing research, as mitigating bias according to WEAT fails to truly neutralize pre-trained transformer models.

Conclusion
We systematically quantify bias in commonly used pre-trained transformers, presenting a unified view of bias in the form of gender and racial likelihoods across a range of popular pre-trained transformers. We analyze factors influencing bias in transformers using three tests, SEQ, P N , and W EAT , and demonstrate the inadequacies of word-embedding neutralization for contextual models. We call for future work to develop robust bias tests and carefully consider the ramifications of design choices.

Ethics & Impact Statement
Our work targets the subject of inherent, societal biases captured by large pre-trained transformer models which are publicly available and widely used. Our results indicate that bias is a significant problem for the community to tackle, and that all pre-trained models currently exhibit some form of biased prediction of gendered or racial tokens in otherwise neutral contexts.
Beneficiaries -Our work seeks to clarify the ways in which commonly used pre-trained transformers exhibit biases. Practitioners building on the power of pre-trained transformers would benefit from knowing, the inherent biases of each model, and thereby taking appropriate steps to ensure that their downstream task is as neutralized as possible. Further, we hope to contribute knowledge which will eventually make all NLP systems more equitable for all people.
Negatively affected parties -Our work does not investigate bias in many other areas, from racial groups outside of European-American/African-American to religious biases or any other inappropriate societal prejudices. Unfortunately, there are few widely-accepted target-set identifiers for NLP research into these biases, and even those which do exist may be poor predictors of underlying demographics (such as the use of first names for racial categorization).
Limitations in scope -As discussed above, our work omits investigations into groups which lack widely-accepted target sets (identifying nouns or pronouns). Even for target sets which do exist, such as Male/Female, target sets may be imperfect. For example, many gendered target sets use first names as identifiers, even though there is no gender inherently tied to a name.