Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview

An increasing number of natural language processing papers address the effect of bias on predictions, introducing mitigation techniques at different parts of the standard NLP pipeline (data and models). However, these works have been conducted individually, without a unifying framework to organize efforts within the field. This situation leads to repetitive approaches, and focuses overly on bias symptoms/effects, rather than on their origins, which could limit the development of effective countermeasures. In this paper, we propose a unifying predictive bias framework for NLP. We summarize the NLP literature and suggest general mathematical definitions of predictive bias. We differentiate two consequences of bias: outcome disparities and error disparities, as well as four potential origins of biases: label bias, selection bias, model overamplification, and semantic bias. Our framework serves as an overview of predictive bias in NLP, integrating existing work into a single structure, and providing a conceptual baseline for improved frameworks.


Introduction
Predictive models in NLP are sensitive to a variety of (often unintended) biases throughout the development process. As a result, fitted models do not generalize well, incurring performance and reliability losses on unseen data. They also have socially undesirable effects by systematically underserving or mispredicting certain user groups.
The general phenomenon of biased predictive models in NLP is not recent. The community has long worked on the domain adaptation problem (Jiang and Zhai, 2007;Daume III, 2007): models fit on newswire data do not perform well on social media and other text types. This problem arises from the tendency of statistical models to pick up on non-generalizable signals during the training process. In the case of domains, these non-generalizations are words, phrases, or senses that occur in one text type, but not another.
However, this kind of variation is not just restricted to text domains: it is a fundamental property of human-generated language: we talk differently than our parents or people from a different part of our country, etc. (Pennebaker and Stone, 2003;Eisenstein et al., 2010;Kern et al., 2016). In other words, language reflects the diverse demographics, backgrounds, and personalities of the people who use it. While these differences are often subtle, they are distinct and cumulative (Trudgill, 2000;Kern et al., 2016;Pennebaker, 2011). Similar to text domains, this variation can lead models to pick up on patterns that do not generalize to other author-demographics, or to rely on undesirable word-demographic relationships.
Bias may be an inherent property of any NLP system (and broadly any statistical model), but this is not per se negative. In essence, biases are priors that inform our decisions (a dialogue system designed for elders might work differently than one for teenagers). Still, undetected and unaddressed, biases can lead to negative consequences: There are aggregate effects for demographic groups, which combine to produce predictive bias. I.e., the label distribution of a predictive model reflects a human attribute in a way that diverges from a theoretically defined "ideal distribution." For example, a Part Of Speech (POS) tagger reflecting how an older generation uses words (Hovy and Søgaard, 2015) diverges from the population as a whole.
A variety of papers have begun to address countermeasures for predictive biases (Li et al., 2018;Elazar and Goldberg, 2018;Coavoux et al., 2018). 1 Each identifies a specific bias and counter-1 An even more extensive body of work on fairness exists as part of the FAT* conferences, which goes beyond the scope features X source features X target label bias Biased annotations, interaction, or latent bias from past classifications.

over-amplification
The model discriminates on a given human attribute beyond its source base-rate. predict Source Population (Model Side) Target Population (Application Side) biased outcomes Ŷ target fit semantic bias Non-ideal associations between attributed lexeme (e.g. gendered pronouns) and non-attributed lexeme (e.g. occupation).
features embedding outcome disparity The distribution of outcomes, given attribute A, is dissimilar than the ideal distribution: error disparity The distribution of error (ϵ) over at least two different values of an attribute (A) are unequal: Embedding Corpus (Pre-trained Side) outcomes Y source origin consequence selection bias The sample of observations themselves are not representative of the application population.

Example
Dep, is_older outcome ideal: 0, 0 : .30 0, 1 : .35 1, 0 : .20 1, 1 : .15 From predicte 0, 0 : .25 0, 1 : .40 1, 0 : .25 1, 1 : .20 error disparity measure on their terms, but it is often not explicitly clear which bias is addressed, where it originates, or how it generalizes. There are multiple sources from which bias can arise within the predictive pipeline, and methods proposed for one specific bias often do not apply to another. As a consequence, much work has focused on bias effects and symptoms rather than their origins. While it is essential to address the effects of bias, it can leave the fundamental origin unchanged (Gonen and Goldberg, 2019), requiring researchers to rediscover the issue over and over. The "bias" discussed in one paper may, therefore, be quite different than that in another. 2 A shared definition and framework of predictive bias can unify these efforts, provide a common terminology, help to identify underlying causes, and allow coordination of countermeasures (Sun et al., 2019). However, such a general framework had yet to be proposed within the NLP community.
To address these problems, we suggest a joint conceptual framework, depicted in Figure 1, outlining and relating the different origins of bias. We base our framework on an extensive survey of the relevant NLP literature, informed by seof this biased-focused paper. Note also that while bias is an ethical issue and contributes to many papers in the ethics in NLP area, the two should not be conflated: ethics covers more than bias.
2 Quantitative social science offers a background for bias (Berk, 1983). However, NLP differs fundamentally in analytic goals (namely out-of-sample prediction for NLP versus parameter inference for hypothesis testing in social science) that bring about NLP-specific situations: biases in word embeddings, annotator labels, or predicting over-amplified demographics. lected works in social science and adjacent fields. We identify four distinct sources of bias: selection bias, label bias, model overamplification, and semantic bias. We can express all of these as differences between (a) a "true" or intended distribution (e.g., over users, labels, or outcomes), and (b) the distribution used or produced by the model. These cases arise at specific points within a typical predictive pipeline: embeddings, source data, labels (human annotators), models, and target data. We provide quantitative definitions of predictive bias in this framework intended to make it easier to: (a) identify biases (because they can be classified), (b) develop countermeasures (because the underlying problem is known), and (c) compare biases and countermeasures across papers. We hope this paper will help researchers spot, compare, and address bias in all its various forms.
Contributions Our primary contributions include: (1) a conceptual framework for identifying and quantifying predictive bias and its origins within a standard NLP pipeline, (2) a survey of biases identified in NLP models, and (3) a survey of methods for countering bias in NLP organized within our conceptual framework.

Definition -Two Types of Disparities
Our definition of predictive bias in NLP builds on its definition within the literature on standardized testing (i.e., SAT, GRE, etc.) Specifically, Swinton (1981) states: By "predictive bias," we refer to a situation in which a [predictive model] is used to predict a specific criterion for a particular population, and is found to give systematically different predictions for subgroups of this population who are in fact identical on that specific criterion. 3 We generalize Swinton's definition in two ways: First, to align notation with standard supervised modeling, we say there are both Y (a random variable representing the "true" values of an outcome) andŶ (a random variable representing the predictions. Next, we allow the concept to apply to differences associated with continuously-valued human attributes rather than simply discrete subgroups of people. 4 Below, we define two types of measurable systematic differences (i.e. "disparities"): (1) a systematic difference between Y and Y ( outcome disparity) and (2) a difference in error ( = |Y −Ŷ ) error disparity, both as a function of a given human attribute, A.
Outcome disparity. Formally, we say an outcome disparity exists for outcome, Y , a domain D (with values source or target), and with respect to attribute, A, when the distribution of the predicted outcome (Q(Ŷ D |A D )) is dissimilar to a given theoretical ideal distribution (P (Y D |A D )): The ideal distribution is specific to the target application. Our framework allows researchers to use their own criteria to determine this distribution. However, the task of doing so may be nontrivial. First, the current distribution within a population may not be accessible. Even when it is, it may not be what most consider the ideal distribution (e.g., the distribution of gender in computer science and the associated disparity of NLP models attributing male pronouns to computer scientists more frequently (Hovy, 2015)). Second, it may be difficult to come to an agreed-upon ideal distribution from a moral or ethical perspective. In such a case, it may be helpful to use an ideal "direction," rather than specifying a specific distribution (e.g., moving toward a uniform distribution of pronouns associated with computer science). Our framework should enable its users to apply evolving standards and norms across NLP's many application contexts.
A prototypical example of outcome disparity is gender disparity in image captions. Zhao et al. (2017) and Hendricks et al. (2018) demonstrate a systematic difference with respect to gender in the outcome of the model,Ŷ even when taking the source distribution as an ideal target distribution: Q(Ŷ target |gender) Q(Y target |gender) ∼ Q(Y source |gender). As a result, captions overpredict females in images with ovens and males in images with snowboards.
Error disparity. We say there is an error disparity when model predictions have larger error for individuals with a given user attribute (or range of attributes in the case of continuously-valued attributes). Formally, the error of a predicted distribution is If there is a difference in D over at least two different values of an attribute A (assuming they have been adequately sampled to establish a distribution of D ) then there is an error disparity: In other words, the error for one group might systematically differ from the error for another group, e.g., the error for green people differs from the error for blue people. Under unbiased conditions, the difference would be equal. This formulation allows us to capture both the discrete case (arguably more common in NLP, for example, in POS tagging) and the continuous case (for example, in age or income prediction).
We propose that if either of these two disparities exist in our target application, then there is a predictive bias. Note that predictive bias is then a property of a model given a specific application, rather than merely an intrinsic property of the model by itself. This definition mirrors predictive bias in standardized testing (Swinton, 1981): "a [predictive model] cannot be called biased without reference to a specific prediction situation; thus, the same instrument may be biased in one application, but unbiased in another." A prototypical example of error disparity is the "Wall Street Journal Effect" -a systematic difference in error as a function of demographics, first documented by Hovy and Søgaard (2015). In theory, POS tagging errors increase the further an author's demographic attributes differ from the average WSJ author of the 1980s and 1990s (on whom many POS taggers were trained -a selection bias, discussed next). Work by Sap et al. (2019) shows error disparity from a different origin, namely unfairness in hate speech detection. They find that annotators for hate speech on social media make more mistakes on posts of black individuals. Contrary to the case above, the disparity is not necessarily due to a difference between author and annotator population (a selection bias). Instead, the label disparity stems from annotators failing to account for the authors' racial background and sociolinguistic norms.

Source and Target
Populations. An important assumption of our framework is that disparities are dependent on the population for which the model will be applied. This assumption is reflected in distinguishing a separate "target population" from the "source population" on which the model was trained. In cross-validation over random folds, models are trained and tested over the same population. However, in practice, models are often applied to novel data that may originate from a different population of people. In other words, the disparity may exist as a model property for one application, but not for another.
Quantifying disparity. Given the definitions of the two types of disparities, we can quantify bias with well-established measures of distributional divergence or deviance. Specifically, we suggest the Log-likelihood ratio as a central metric: where p(Y |A) is the specified ideal distribution (either derived empirically or theoretically) and p(Ŷ |A) is the distribution within the data. For error disparity the ideal distribution is always the Uniform andŶ is replaced with the error. KL divergence (D KL [P (Ŷ |A)P (Y |A)]) can be used as a secondary, more scalable alternative.
Our measure above attempts to synthesize metrics others have used in works focused on specific biases. For example, the definition of outcome disparity is analogous to that used for semantic bias. Kurita et al. (2019) quantify bias in embeddings as the difference in log probability score when re-placing words suspected to carry semantic differences ('he', 'she') with a mask: N OU N is replaced with a specific noun to check for semantic bias (e.g., an occupation), and P RON is an associated demographic word (e.g., "he" or "she").

Four Origins of Bias
But what leads to an outcome disparity or error disparity? We identify four points within the standard supervised NLP pipeline where bias may originate: (1) the training labels (label bias), (2) the samples used as observations -for training or testing (selection bias), (3) the representation of data (semantic bias), or (4) due to the fit method itself (overamplification).
Label Bias Label bias emerges when the distribution of the dependent variable in the data source diverges substantially from the ideal distribution: Here, the labels themselves are erroneous concerning the demographic attribute of interest (as compared to the source distribution). Sometimes, this bias is due to a non-representative group of annotators (Joseph et al., 2017). In other cases, it may be due to a lack of domain expertise (Plank et al., 2014), or due to preconceived notions and stereotypes held by the annotators (Sap et al., 2019).
Selection bias. Selection bias emerges due to non-representative observations. I.e., when the users generating the training (source) observations differ from the user distribution of the target, where the model will be applied. Selection bias (sometimes also referred to as sample bias) has long been a concern in the social sciences. At this point, testing for such a bias is a fundamental consideration in study design (Berk, 1983;Culotta, 2014). Non-representative data is the origin for selection bias.
Within NLP, some of the first works to note demographic biases were due to a selection bias (Hovy and Søgaard, 2015;Jørgensen et al., 2015).
A prominent example is the so-called "Wall Street Journal effect", where syntactic parsers and part-of-speech taggers are most accurate over language written by middleaged white men. The effect occurs because this group happened to be the predominant authors' demographics of the WSJ articles, which are traditionally used to train syntactic models (Garimella et al., 2019). The same effect was reported for language identification difficulties for African-American Vernacular English (Blodgett and O'Connor, 2017;Jurgens et al., 2017).
The predicted output is dissimilar from the ideal distribution, leading, for example, to lower accuracy for a given demographic, since the source did not reflect the ideal distribution. We say that the distribution of human attribute, A, within the source data, s, is dissimilar to the distribution of A within the target data, t: Selection bias has several peculiarities. First, it is dependent on the ideal distribution of the target population, so a model may have selection bias for one application (and its associated target population), but not for another. Also, consider that either the source features (X s ) or source labels (Y s ) may be non-representative. In many situations, the distributions for the features and labels are the same. However, there are some cases where they diverge. For example, when using features from age-biased tweets, but labels from non-biased census surveys. In such cases, we need to take multiple analysis levels into account: corrections can be applied to user features as they are aggregated to communities (Almodaresi et al., 2017). The consequences could be both outcome and error disparity.
One of the challenges in addressing selection bias is that we can not know a priori what sort of (demographic) attribute will be important to control. Age and gender are well-studied, but others might be less obvious. We might someday realize that a formerly innocuous attribute (say, handedness) turns out to be relevant for selection biases. This problem is known as The Known and Unknown Unknowns.
As we know, there are known knowns: there are things we know we know. We also know there are known unknowns: that is to say, we know there are some things we do not know. But there are also unknown unknowns: the ones we don't know we don't know.
selection bias, label bias selection bias repr. label bias no bias We will see later how better documentation can help future researchers address this problem.
Overamplification. Another source of bias can occur even when there is no label or selection bias. In overamplification, a model relies on a small difference between human attributes with respect to the objective (even an acceptable difference matching the ideal distribution), but amplifies this difference to be much more pronounced in the predicted outcomes. The origins of overamplification are during learning itself. The model learns to pick up on imperfect evidence for the outcome, which brings out the bias. Formally, in overamplification the predicted distribution (Q(Ŷ s |A s )) is dissimilar to the source training distribution (Q(Y s |A s )) with respect to a human attribute, A. The predicted distribution is therefore also dissimilar to the target ideal distribution: For example, Yatskar et al. (2016) found that in the imSitu image captioning data set, 58% of captions involving a person in a kitchen mention women. However, standard models trained on such data end up predicting people depicted in kitchens as women 63% of the time (Zhao et al., 2017). In other words, an error in generating a gender reference within the text (e.g., "A [woman man] standing next to a counter-top") males an incorrect female reference much more common.
The occurrence of overamplification in the absence of other biases is an important motivation for countermeasures. It does not require bias on the part of the annotator, data collector, or even the programmer/data analyst (though it can escalate existing biases and the models' statistical discrimination along a demographic dimension). In particular, it extends countermeasures beyond the point some authors have made, that they are merely cosmetic and do not address the underlying cause: biased language in society (Gonen and Goldberg, 2019).
Semantic bias. Embeddings (i.e., vectors representing the meaning of words or phrases) have become a mainstay of modern NLP, providing more flexible representations that feed both traditional and deep learning models. However, these representations often contain unintended or undesirable associations and societal stereotypes (e.g., connecting medical doctors more frequently to male pronouns than female pronouns, see Bolukbasi et al. (2016); Caliskan et al. (2017)). We adopt the term used for this phenomenon by others, "semantic bias".
Formally, we attribute semantic bias to the parameters of the embedding model (θ emb ). Semantic bias is a unique case since it indirectly affects both outcome disparity and error disparity by creating other biases, such as overamplification (Yatskar et al., 2016;Zhao et al., 2017) or diverging words associations within embeddings or language models (Bolukbasi et al., 2016;Rudinger et al., 2018). However, we distinguish it from the other biases, since the population does not have to be people, but rather words in contexts that yield non-ideal associations. For example, the issue is not (only) that a particular gender authors more of the training data for the embeddings. Instead, that gendered pronouns are mentioned alongside occupations according to a non-ideal distribution (e.g., texts talk more about male doctors and female nurses than vice versa). Furthermore, pretrained embeddings are often used without access to the original data (or the resources to process it). We thus suggest that embedding models themselves are a distinct source of bias within NLP predictive pipelines.
They have consequently received increased attention, with dedicated sessions at NAACL and ACL 2019. As an example, Kurita et al. (2019) quantify human-like bias in BERT. Using the Gender Pronoun Resolution (GPR) task, they find that, even after balancing the data set, the model predicts no female pronouns with high probability. Semantic bias is also of broad interest to the social sciences as a diagnostic tool (see Section A). However, their inclusion in our framework is not for reasons of social scientific diagnostics, but rather to guide mindful researchers where to look for problems.
Multiple Biases. Biases occur not only in isolation, but they also compound to increase their effects. Label and selection bias can -and often do -interact, so it can be challenging to distinguish them. Table 1 shows the different conditions to understand the boundaries of one or another.
Consider the case where a researcher chooses to balance a sentiment data set for a user attribute, e.g., age. This decision can directly impact the label distribution of the target variable. E.g., because the positive label is over-represented in a minority age group. Models learn to exploit this confounding correlation between age and label prevalence and magnify it even more. The resulting model may be useless, as it only captures the distribution in the synthetic data sample. We see this situation in early work on using social media data to predict mental health conditions. Models to distinguish PTSD from depression turned out to mainly capture the differences in user age and gender, rather than language reflecting the actual conditions (Preoţiuc-Pietro et al., 2015).

Other Bias Definitions and Frameworks
While this is the first attempt at a comprehensive conceptual framework for bias in NLP, alternative frameworks exist, both in other fields and based on more qualitative definitions. Friedler et al. (2016) define bias as unfairness in algorithms. They specify the idea of a "construct" space, which captures the latent features in the data that help predict the right outcomes. They suggest that finding those latent variables would also enable us to produce the right outcomes. Hovy and Spruit (2016) take a broader scope on bias based on ethics in new technologies. They list three qualitative sources (data, modeling, and research design), and suggest three corresponding types of biases: demographic bias, overgeneralization, and topic exposure. Suresh and Guttag (2019) propose a qualitative framework for bias in machine learning, defining bias as a "potential harmful property of the data". They categorize bias into historical bias, representation bias, measurement bias, and evaluation bias. Glymour and Herington (2019) classify algorithmic bias, in general, into four different categories, depending on the causal conditional dependencies to which it is sensitive: procedural bias, outcome bias, behavior-relative error bias, and score-relative error bias. Corbett-Davies and Goel (2018) propose statistical limitations of the three prominent definitions of fair-ness (anti-classification, classification parity, and calibration), enabling researchers to develop fairer machine learning algorithms.
Our framework focuses on NLP, but it follows Glymour and Herington (2019) in providing probabilistic based definitions of bias. It incorporates and formalizes the above to varying degrees.
In social sciences, bias definitions often relate to the ability to test causal hypotheses. Hernán et al. (2004) propose a common structure for various types of selection bias. They define bias as the difference between a variable and the outcome, and the causal effect of a variable on the outcome. E.g., when the causal risk ratio (CRR) differs from associational risk ratio (ARR). Similarly, Baker et al. (2013) define bias as uncontrolled covariates or "disturbing variables" that are related to measures of interest.
Others provide definitions restricted to particular applications. For example, Caliskan et al. Our framework encompasses bias-related work in the social sciences. Please see the supplement in A.1 for a brief overview.

Countermeasures
We group proposed countermeasures based on the origin(s) on which they act.
Label Bias. There are several ways to address label bias, typically by controlling for biases of the annotators (Pavlick et al., 2014). Disagreement between annotators has long been an active research area in NLP, with various approaches to measure and quantify disagreement through interannotator agreement (IAA) scores to remove outliers (Artstein and Poesio, 2008). Lately, there has been more of an emphasis on embracing variation through the use of Bayesian annotation models (Hovy et al., 2013;Passonneau and Carpenter, 2014;Paun et al., 2018). These models arrive at a much less biased estimate for the final label than majority voting, by attaching confidence scores to each annotator, and reweighting them through that method. Other approaches have explored harnessing the inherent disagreement among annotators to guide the training process (Plank et al., 2014). By weighting updates by the amount of disagreement on the labels, this method prevents bias towards any one label. The weighted updates act as a regularizer during training, which might also help prevent overamplification. If annotators behave in predictable ways to produce artifacts (i.e., always add "not" to form a contradiction), we can train a model on such biased features and use it in ensemble learning (Clark et al., 2019). Hays et al. (2015) attempt to make Web studies equivalent to representative focus group panels. They give an overview of probabilistic and non-probabilistic approaches to generate the Internet panels that contribute to the data generation. Along with the six demographic attributes (age, gender, race/ethnicity, education, marital status, and income), they use poststratification to reduce the bias (some of these methods cross into addressing selection bias).
Selection bias. The primary source for selection bias is the mismatch between the sample distribution and the ideal distribution. Consequently, any countermeasures need to re-align the two distributions to minimize this mismatch.
The easiest way to address the mismatch is to re-stratify the data to more closely match the ideal distribution. However, this often involves downsampling an overly represented class, which reduces the number of available instances. Mohammady and Culotta (2014) use a stratified sampling technique to reduce the selection bias in the data. Almeida et al. (2015) use demographic user attributes, including age, gender, and social status, to predict the election results in six different cities of Brazil. They use stratified sampling on all the resulting groups to reduce selection bias.
Rather than re-sampling, others use reweighting or poststratifying to reduce selection bias. Culotta (2014) estimates county-level health statistics based on social media data. He shows we can stratify based on external socio-demographic data about a community's composition (e.g., gender and race). Park et al. (2006) estimate statewise public opinions using the National Surveys corpus. To reduce bias, they use various socioeconomic and demographic attributes (state of res-idence, sex, ethnicity, age, and education level) in a multilevel logistic regression. Choy et al. (2011) and Choy et al. (2012) also use race and gender as features for reweighting in predicting the results of the Singapore and US presidential elections. Baker et al. (2013) study how selection bias manifests in inferences for a larger population, and how to avoid it. Apart from the basic demographic attributes, they also consider attitudinal and behavioral attributes for the task. They suggest using reweighting, ranking reweighting or propensity score adjustment, and sample-matching techniques to reduce selection bias.
Others have suggested combinations of these approaches. Hernán et al. (2004), propose Directed Acyclic graphs for various heterogeneous types of selection bias, and suggest using stratified sampling, regression adjustment, or inverse probability weighting to avoid the bias in the data. Zagheni and Weber (2015), study the use of Internet Data for demographic studies and propose two approaches to reduce the selection bias in their task. If the ground truth is available, they adjust selection bias based on the calibration of a stochastic microsimulation. If unavailable, they suggest using a difference-in-differences technique to find out trends on the Web. Zmigrod et al. (2019) show that gender-based selection bias could be addressed by data augmentation, i.e., by adding slightly altered examples to the data. This addition addresses selection bias originating in the features (X source ), so that the model is fit on a more gender-representative sample. Their approach is similar to the reweighting of poll data based on demographics, which can be applied more directly to tweet-based population surveillance (see our last case study, A.2).
Li et al. (2018) introduce a model-based countermeasure. They use an adversarial multitasklearning setup to model demographic attributes as auxiliary tasks explicitly. By reversing the gradient for those tasks during backpropagation, they effectively force the model to ignore confounding signals associated with the demographic attributes. Apart from improving overall performance across demographics, they show that it also protects user privacy. The findings from Elazar and Goldberg (2018), however, suggest that even with adversarial training, internal representations still retain traces of demographic information.
Overamplification. In its simplest form, overamplification of inherent bias by the model can be corrected by downweighting the biased instances in the sample, to discourage the model from exaggerating the effects.
A common approach involves using synthetic matched distributions. To address gender bias in neural network approaches to coreference resolution Rudinger et al. (2018); Zhao et al. (2018)  Semantic bias. Countermeasures for semantic bias in embeddings typically attempt to adjust the parameters of the embedding model to reflect a target distribution more accurately. Because all of the above techniques can be applied for model fitting, here we highlight techniques that are more specific to addressing bias in embeddings. Bolukbasi et al. (2016) suggest that techniques to de-bias embeddings can be classified into two approaches: hard de-biasing (completely removes bias) and soft de-biasing (partially removes bias avoiding side effects). Romanov et al. (2019) generalize this work to a multi-class setting, exploring methods to mitigate bias in an occupation classification task. They reduce the correlation between the occupation of people and the word embedding of their names, and manage to simultaneously reduce race and gender biases without reducing the classifier's performance. Manzini et al. (2019), identify the bias subspace using principal compo-nent analysis and remove the biased components using hard Neutralize and Equalize de-biasing and soft biasing methods proposed by Bolukbasi et al. (2016). The above examples evaluate success through the semantic analogy task (Mikolov et al., 2013), a method whose informativeness has since been questioned, though (Nissim et al., 2019). For a dedicated overview of semantic de-biasing techniques see Lauscher et al. (2020).
Social-Level Mitigation. Several initiatives propose standardized documentation to trace potential biases, and to ultimately mitigate them. Data Statements Bender and Friedman (2018) suggest clearly disclosing data selection, annotation, and curation processes explicitly and transparently. Similarly, Gebru et al. (2018) suggest Datasheets to cover the lifecycle of data including "motivation for dataset creation; dataset composition; data collection process; data preprocessing; dataset distribution; dataset maintenance; and legal and ethical considerations". Mitchell et al. (2019) extend this idea to include model specifications and performance details on different user groups. Hitti et al. (2019) propose a taxonomy for assessing the gender bias of a data set. While these steps do not directly mitigate bias, they can encourage researchers to identify and communicate sources of label or selection bias. Such documentation, combined with a conceptual framework to guide specific mitigation techniques, acts as an essential mitigation technique at the level of the research community. See Appendix A.2 for case studies outlining various types of bias in several NLP tasks.

Conclusion
We present a comprehensive overview of the recent literature on predictive bias in NLP. Based on this survey, we develop a unifying conceptual framework to describe bias sources and their effects (rather than just their effects). This framework allows us to group and compare works on countermeasures. Rather than giving the impression that bias is a growing problem, we would like to point out that bias is not necessarily something gone awry, but rather something nearly inevitable in statistical models. We do, however, stress that we need to acknowledge and address bias with proactive measures. Having a formal framework of the causes can help us achieve this.
We would like to leave the reader with these main points: (1) every predictive model with errors is bound to have disparities over human attributes (even those not directly integrating human attributes); (2) disparities can result from a variety of origins -the embedding model, the feature sample, the fitting process, and the outcome sample -within the standard predictive pipeline; (3) selection of "protected attributes" (or human attributes along which to avoid biases) is necessary for measuring bias, and often helpful for mitigating bias and increasing the generalization ability of the models.
We see this paper as a step toward a unified understanding of bias in NLP. We hope it inspires further work in both identifying and countering bias, as well as conceptually and mathematically defining bias in NLP.

Framework Application Steps (TL;DR)
1. Specify target population and an ideal distribution of the attribute (A) to be investigated for bias; Consult datasheets and data statements 5 if available for the model source; 2. If outcome disparity or error disparity, check for potential origins: (a) if label bias: use post-stratification or retrain annotators. (b) if selection bias: use stratified sampling to match source to target populations, or use post-stratification, re-weighting techniques. (c) if overamplification: synthetically match distributions or add outcome disparity to cost function. (d) if semantic bias: retrain or retrofit embeddings considering approaches above, but with attributed (e.g., gendered) words (rather than people) as the population. tion, while the source outcomes match that target outcomes (Q(X source |A) P (X target |A) but Q(Y source |A) ∼ P (Y target |A)).
In this case, the effectiveness of countermeasures preventing selection and semantic biases (for X source and X target ) should result in increased predictive performance against a representative community outcome. Indeed, Giorgi et al. (2019) adjust the feature estimates, X, to match representative demographics and socio-economics by using inferred user attributes, and find improved predictions for the life satisfaction of a Twitter community.