Language (Technology) is Power: A Critical Survey of “Bias” in NLP

We survey 146 papers analyzing “bias” in NLP systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that analyzing “bias” is an inherently normative process. We further find that these papers’ proposed quantitative techniques for measuring or mitigating “bias” are poorly matched to their motivations and do not engage with the relevant literature outside of NLP. Based on these findings, we describe the beginnings of a path forward by proposing three recommendations that should guide work analyzing “bias” in NLP systems. These recommendations rest on a greater recognition of the relationships between language and social hierarchies, encouraging researchers and practitioners to articulate their conceptualizations of “bias”---i.e., what kinds of system behaviors are harmful, in what ways, to whom, and why, as well as the normative reasoning underlying these statements—and to center work around the lived experiences of members of communities affected by NLP systems, while interrogating and reimagining the power relations between technologists and such communities.


Introduction
A large body of work analyzing "bias" in natural language processing (NLP) systems has emerged in recent years, including work on "bias" in embedding spaces (e.g., Bolukbasi et al., 2016a;Caliskan et al., 2017;Gonen and Goldberg, 2019; as well as work on "bias" in systems developed for a breadth of tasks including language modeling Bordia and Bowman, 2019), coreference resolution Zhao et al., 2018a), machine translation ( Vanmassenhove et al., 2018;, sentiment analysis , and hate speech/toxicity detection (e.g., , among others. Although these papers have laid vital groundwork by illustrating some of the ways that NLP systems can be harmful, the majority of them fail to engage critically with what constitutes "bias" in the frst place. Despite the fact that analyzing "bias" is an inherently normative process-in which some system behaviors are deemed good and others harmful-papers on "bias" in NLP systems are rife with unstated assumptions about what kinds of system behaviors are harmful, in what ways, to whom, and why. Indeed, the term "bias" (or "gender bias" or "racial bias") is used to describe a wide range of system behaviors, even though they may be harmful in different ways, to different groups, or for different reasons. Even papers analyzing "bias" in NLP systems developed for the same task often conceptualize it differently.
For example, the following system behaviors are all understood to be self-evident statements of "racial bias": (a) embedding spaces in which embeddings for names associated with African Americans are closer (compared to names associated with European Americans) to unpleasant words than pleasant words (Caliskan et al., 2017); (b) sentiment analysis systems yielding different intensity scores for sentences containing names associated with African Americans and sentences containing names associated with European Americans ; and (c) toxicity detection systems scoring tweets containing features associated with African-American English as more offensive than tweets without these features . Moreover, some of these papers focus on "racial bias" expressed in written text, while others focus on "racial bias" against authors. This use of imprecise terminology obscures these important differences.
We survey 146 papers analyzing "bias" in NLP systems, fnding that their motivations are often vague and inconsistent. Many lack any normative reasoning for why the system behaviors that are described as "bias" are harmful, in what ways, and to whom. Moreover, the vast majority of these papers do not engage with the relevant literature outside of NLP to ground normative concerns when proposing quantitative techniques for measuring or mitigating "bias." As a result, we fnd that many of these techniques are poorly matched to their motivations, and are not comparable to one another.
We then describe the beginnings of a path forward by proposing three recommendations that should guide work analyzing "bias" in NLP systems. We argue that such work should examine the relationships between language and social hierarchies; we call on researchers and practitioners conducting such work to articulate their conceptualizations of "bias" in order to enable conversations about what kinds of system behaviors are harmful, in what ways, to whom, and why; and we recommend deeper engagements between technologists and communities affected by NLP systems. We also provide several concrete research questions that are implied by each of our recommendations.

Method
Our survey includes all papers known to us analyzing "bias" in NLP systems-146 papers in total. We omitted papers about speech, restricting our survey to papers about written text only. To identify the 146 papers, we frst searched the ACL Anthology 1 for all papers with the keywords "bias" or "fairness" that were made available prior to May 2020. We retained all papers about social "bias," and discarded all papers about other defnitions of the keywords (e.g., hypothesis-only bias, inductive bias, media bias). We also discarded all papers using "bias" in NLP systems to measure social "bias" in text or the real world (e.g., Garg et al., 2018).
To ensure that we did not exclude any relevant 1 https://www.aclweb.org/anthology/  papers without the keywords "bias" or "fairness," we also traversed the citation graph of our initial set of papers, retaining any papers analyzing "bias" in NLP systems that are cited by or cite the papers in our initial set. Finally, we manually inspected any papers analyzing "bias" in NLP systems from leading machine learning, human-computer interaction, and web conferences and workshops, such as ICML, NeurIPS, AIES, FAccT, CHI, and WWW, along with any relevant papers that were made available in the "Computation and Language" and "Computers and Society" categories on arXiv prior to May 2020, but found that they had already been identifed via our traversal of the citation graph. We provide a list of all 146 papers in the appendix. In Table 1, we provide a breakdown of the NLP tasks covered by the papers. We note that counts do not sum to 146, because some papers cover multiple tasks. For example, a paper might test the effcacy of a technique for mitigating "bias" in embedding spaces in the context of sentiment analysis.
Once identifed, we then read each of the 146 papers with the goal of categorizing their motivations and their proposed quantitative techniques for measuring or mitigating "bias." We used a previously developed taxonomy of harms for this categorization, which differentiates between so-called allocational and representational harms (Barocas et al., 2017;Crawford, 2017). Allocational harms arise when an automated system allocates resources (e.g., credit) or opportunities (e.g., jobs) unfairly to different social groups; representational harms arise when a system (e.g., a search engine) represents some social groups in a less favorable light than others, demeans them, or fails to recognize their existence altogether. Adapting and extending this taxonomy, we categorized the 146 papers' motivations and techniques into the following categories: . Allocational harms. Allocational harms  30  4  Stereotyping  50  58  Other representational harms  52  43  Questionable correlations  47  42  Vague/unstated  23  0  Surveys, frameworks, and  20 20 meta-analyses . Stereotyping that propagates negative generalizations about particular social groups. . Differences in system performance for different social groups, language that misrepresents the distribution of different social groups in the population, or language that is denigrating to particular social groups. . Questionable correlations between system behavior and features of language that are typically associated with particular social groups. . Vague descriptions of "bias" (or "gender bias" or "racial bias") or no description at all. . Surveys, frameworks, and meta-analyses.

Papers Category Motivation Technique
In Table 2 we provide counts for each of the six categories listed above. (We also provide a list of the papers that fall into each category in the appendix.) Again, we note that the counts do not sum to 146, because some papers state multiple motivations, propose multiple techniques, or propose a single technique for measuring or mitigating multiple harms. Table 3, which is in the appendix, contains examples of the papers' motivations and techniques across a range of different NLP tasks.

Findings
Categorizing the 146 papers' motivations and proposed quantitative techniques for measuring or mitigating "bias" into the six categories listed above enabled us to identify several commonalities, which we present below, along with illustrative quotes. 2 We grouped several types of representational harms into two categories to refect that the main point of differentiation between the 146 papers' motivations and proposed quantitative techniques for measuring or mitigating "bias" is whether or not they focus on stereotyping. Among the papers that do not focus on stereotyping, we found that most lack suffciently clear motivations and techniques to reliably categorize them further.

Motivations
Papers state a wide range of motivations, multiple motivations, vague motivations, and sometimes no motivations at all. We found that the papers' motivations span all six categories, with several papers falling into each one. Appropriately, papers that provide surveys or frameworks for analyzing "bias" in NLP systems often state multiple motivations (e.g., Bender, 2019;. However, as the examples in These examples leave unstated what it might mean for an NLP system to "discriminate," what constitutes "systematic biases," or how NLP systems contribute to "social injustice" (itself undefned).
Papers' motivations sometimes include no normative reasoning. We found that some papers (32%) are not motivated by any apparent normative concerns, often focusing instead on concerns about system performance. For example, the frst quote below includes normative reasoning-namely that models should not use demographic information to make predictions-while the other focuses on learned correlations impairing system performance.
"In [text classifcation], models are expected to make predictions with the semantic information rather than with the demographic group identity information (e.g., 'gay', 'black') contained in the sentences." -  "An over-prevalence of some gendered forms in the training data leads to translations with identifable errors. Translations are better for sentences involving men and for sentences containing stereotypical gender roles." -  Even when papers do state clear motivations, they are often unclear about why the system behaviors that are described as "bias" are harmful, in what ways, and to whom. We found that even papers with clear motivations often fail to explain what kinds of system behaviors are harmful, in what ways, to whom, and why. For example, "Deploying these word embedding algorithms in practice, for example in automated translation systems or as hiring aids, runs the serious risk of perpetuating problematic biases in important societal contexts." -Brunet et al. (2019) " [I]f the systems show discriminatory behaviors in the interactions, the user experience will be adversely affected." -  These examples leave unstated what "problematic biases" or non-ideal user experiences might look like, how the system behaviors might result in these things, and who the relevant stakeholders or users might be. In contrast, we fnd that papers that provide surveys or frameworks for analyzing "bias" in NLP systems often name who is harmed, acknowledging that different social groups may experience these systems differently due to their different relationships with NLP systems or different social positions. For example,  argue for a "deep understanding of the user groups [sic] characteristics, contexts, and interests" when designing conversational agents.
Papers about NLP systems developed for the same task often conceptualize "bias" differently. Even papers that cover the same NLP task often conceptualize "bias" in ways that differ substantially and are sometimes inconsistent. Rows 3 and 4 of Table 3 (in the appendix) contain machine translation papers with different conceptualizations of "bias," leading to different proposed techniques, while rows 5 and 6 contain papers on "bias" in embedding spaces that state different motivations, but propose techniques for quantifying stereotyping.
Papers' motivations confate allocational and representational harms. We found that the papers' motivations sometimes (16%) name immediate representational harms, such as stereotyping, alongside more distant allocational harms, which, in the case of stereotyping, are usually imagined as downstream effects of stereotypes on résumé fltering. Many of these papers use the imagined downstream effects to justify focusing on particular system behaviors, even when the downstream effects are not measured. Papers on "bias" in embedding spaces are especially likely to do this because embeddings are often used as input to other systems: "However, none of these papers [on embeddings] have recognized how blatantly sexist the embeddings are and hence risk introducing biases of various types into real-world systems." - Bolukbasi et al. (2016a) "It is essential to quantify and mitigate gender bias in these embeddings to avoid them from affecting downstream applications." -  In contrast, papers that provide surveys or frameworks for analyzing "bias" in NLP systems treat representational harms as harmful in their own right. For example,  and  cite the harmful reproduction of dominant linguistic norms by NLP systems (a point to which we return in section 4), while Bender (2019) outlines a range of harms, including seeing stereotypes in search results and being made invisible to search engines due to language practices.

Techniques
Papers' techniques are not well grounded in the relevant literature outside of NLP. Perhaps unsurprisingly given that the papers' motivations are often vague, inconsistent, and lacking in normative reasoning, we also found that the papers' proposed quantitative techniques for measuring or mitigating "bias" do not effectively engage with the relevant literature outside of NLP. Papers on stereotyping are a notable exception: the Word Embedding Association Test (Caliskan et al., 2017) draws on the Implicit Association Test (Greenwald et al., 1998) from the social psychology literature, while several techniques operationalize the well-studied "Angry Black Woman" stereotype Tan and Celis, 2019) and the "double bind" faced by women Tan and Celis, 2019), in which women who succeed at stereotypically male tasks are perceived to be less likable than similarly successful men (Heilman et al., 2004). Tan and Celis (2019) also examine the compounding effects of race and gender, drawing on Black feminist scholarship on intersectionality (Crenshaw, 1989).
Papers' techniques are poorly matched to their motivations. We found that although 21% of the papers include allocational harms in their motivations, only four papers actually propose techniques for measuring or mitigating allocational harms.
Papers focus on a narrow range of potential sources of "bias." We found that nearly all of the papers focus on system predictions as the potential sources of "bias," with many additionally focusing on "bias" in datasets (e.g., differences in the number of gendered pronouns in the training data ). Most papers do not interrogate the normative implications of other decisions made during the development and deployment lifecycleperhaps unsurprising given that their motivations sometimes include no normative reasoning. A few papers are exceptions, illustrating the impacts of task defnitions, annotation guidelines, and evaluation metrics: Cao and Daumé (2019) study how folk conceptions of gender (Keyes, 2018) are reproduced in coreference resolution systems that assume a strict gender dichotomy, thereby maintaining cisnormativity;  focus on the effect of priming annotators with information about possible dialectal differences when asking them to apply toxicity labels to sample tweets, fnding that annotators who are primed are signifcantly less likely to label tweets containing features associated with African-American English as offensive.

A path forward
We now describe how researchers and practitioners conducting work analyzing "bias" in NLP systems might avoid the pitfalls presented in the previous section-the beginnings of a path forward. We propose three recommendations that should guide such work, and, for each, provide several concrete research questions. We emphasize that these questions are not comprehensive, and are intended to generate further questions and lines of engagement. Our three recommendations are as follows: (R1) Ground work analyzing "bias" in NLP systems in the relevant literature outside of NLP that explores the relationships between language and social hierarchies. Treat representational harms as harmful in their own right.
(R2) Provide explicit statements of why the system behaviors that are described as "bias" are harmful, in what ways, and to whom. Be forthright about the normative reasoning (Green, 2019) underlying these statements.
(R3) Examine language use in practice by engaging with the lived experiences of members of communities affected by NLP systems. Interrogate and reimagine the power relations between technologists and such communities.

Language and social hierarchies
Turning frst to (R1), we argue that work analyzing "bias" in NLP systems will paint a much fuller picture if it engages with the relevant literature outside of NLP that explores the relationships between language and social hierarchies. Many disciplines, including sociolinguistics, linguistic anthropology, sociology, and social psychology, study how language takes on social meaning and the role that language plays in maintaining social hierarchies. For example, language is the means through which social groups are labeled and one way that beliefs about social groups are transmitted (e.g., Maass, 1999;Beukeboom and Burgers, 2019). Group labels can serve as the basis of stereotypes and thus reinforce social inequalities: "[T]he label content functions to identify a given category of people, and thereby conveys category boundaries and a position in a hierarchical taxonomy" (Beukeboom and Burgers, 2019). Similarly, "controlling images," such as stereotypes of Black women, which are linguistically and visually transmitted through literature, news media, television, and so forth, provide "ideological justifcation" for their continued oppression (Collins, 2000, Chapter 4).
As a result, many groups have sought to bring about social changes through changes in language, disrupting patterns of oppression and marginalization via so-called "gender-fair" language (Sczesny et al., 2016;Menegatti and Rubini, 2017), language that is more inclusive to people with disabilities (ADA, 2018), and language that is less dehumanizing (e.g., abandoning the use of the term "illegal" in everyday discourse on immigration in the U.S. (Rosa, 2019)). The fact that group labels are so contested is evidence of how deeply intertwined language and social hierarchies are. Taking "gender-fair" language as an example, the hope is that reducing asymmetries in language about women and men will reduce asymmetries in their social standing. Meanwhile, struggles over language use often arise from dominant social groups' desire to "control both material and symbolic resources"-i.e., "the right to decide what words will mean and to control those meanings"-as was the case in some white speakers' insistence on using offensive place names against the objections of Indigenous speakers (Hill, 2008, Chapter 3).
Sociolinguists and linguistic anthropologists have also examined language attitudes and language ideologies, or people's metalinguistic beliefs about language: Which language varieties or practices are taken as standard, ordinary, or unmarked? Which are considered correct, prestigious, or appropriate for public use, and which are considered incorrect, uneducated, or offensive (e.g., Campbell-Kibler, 2009;Preston, 2009;Loudermilk, 2015;Lanehart and Malik, 2018)? Which are rendered invisible (Roche, 2019)? 3 Language ideologies play a vital role in reinforcing and justifying social hierarchies because beliefs about language varieties or practices often translate into beliefs about their speakers (e.g. Alim et al., 2016;Rosa and Flores, 2017;Craft et al., 2020). For example, in the U.S., the portrayal of non-white speakers' language varieties and practices as linguistically defcient helped to justify violent European colonialism, and today continues to justify enduring racial hierarchies by maintaining views of non-white speakers as lacking the language "required for complex thinking processes and successful engagement in the global economy" (Rosa and Flores, 2017).
Recognizing the role that language plays in maintaining social hierarchies is critical to the future of work analyzing "bias" in NLP systems. First, it helps to explain why representational harms are harmful in their own right. Second, the complexity of the relationships between language and social hierarchies illustrates why studying "bias" in NLP systems is so challenging, suggesting that researchers and practitioners will need to move beyond existing algorithmic fairness techniques. We argue that work must be grounded in the relevant literature outside of NLP that examines the relationships between language and social hierarchies; without this grounding, researchers and practitioners risk measuring or mitigating only what is convenient to measure or mitigate, rather than what is most normatively concerning.
More specifcally, we recommend that work analyzing "bias" in NLP systems be reoriented around the following question: How are social hierarchies, language ideologies, and NLP systems coproduced? This question mirrors Benjamin's (2020) call to examine how "race and technology are coproduced"-i.e., how racial hierarchies, and the ideologies and discourses that maintain them, create and are re-created by technology. We recommend that researchers and practitioners similarly ask how existing social hierarchies and language ideologies drive the development and deployment of NLP systems, and how these systems therefore reproduce these hierarchies and ideologies. As a starting point for reorienting work analyzing "bias" in NLP systems around this question, we provide the following concrete research questions: .  (Olteanu et al., 2017)? Are any non-quantitative evaluations performed? . How do NLP systems reproduce or transform language ideologies? Which language varieties or practices come to be deemed good or bad? Might "good" language simply mean language that is easily handled by existing NLP systems? For example, linguistic phenomena arising from many language practices (Eisenstein, 2013) are described as "noisy text" and often viewed as a target for "normalization." How do the language ideologies that are reproduced by NLP systems maintain social hierarchies? . Which representational harms are being measured or mitigated? Are these the most normatively concerning harms, or merely those that are well handled by existing algorithmic fairness techniques? Are there other representational harms that might be analyzed?

Conceptualizations of "bias"
Turning now to (R2), we argue that work analyzing "bias" in NLP systems should provide explicit statements of why the system behaviors that are described as "bias" are harmful, in what ways, and to whom, as well as the normative reasoning underlying these statements. In other words, researchers and practitioners should articulate their conceptualizations of "bias." As we described above, papers often contain descriptions of system behaviors that are understood to be self-evident statements of "bias." This use of imprecise terminology has led to papers all claiming to analyze "bias" in NLP systems, sometimes even in systems developed for the same task, but with different or even inconsistent conceptualizations of "bias," and no explanations for these differences. Yet analyzing "bias" is an inherently normative process-in which some system behaviors are deemed good and others harmful-even if assumptions about what kinds of system behaviors are harmful, in what ways, for whom, and why are not stated. We therefore echo calls by Bardzell andBardzell (2011), Keyes et al. (2019), and Green (2019) for researchers and practitioners to make their normative reasoning explicit by articulating the social values that underpin their decisions to deem some system behaviors as harmful, no matter how obvious such values appear to be. We further argue that this reasoning should take into account the relationships between language and social hierarchies that we described above. First, these relationships provide a foundation from which to approach the normative reasoning that we recommend making explicit. For example, some system behaviors might be harmful precisely because they maintain social hierarchies. Second, if work analyzing "bias" in NLP systems is reoriented to understand how social hierarchies, language ideologies, and NLP systems are coproduced, then this work will be incomplete if we fail to account for the ways that social hierarchies and language ideologies determine what we mean by "bias" in the frst place. As a starting point, we therefore provide the following concrete research questions: . What kinds of system behaviors are described as "bias"? What are their potential sources (e.g., general assumptions, task defnition, data)? . In what ways are these system behaviors harmful, to whom are they harmful, and why? . What are the social values (obvious or not) that underpin this conceptualization of "bias?"

Language use in practice
Finally, we turn to (R3). Our perspective, which rests on a greater recognition of the relationships between language and social hierarchies, suggests several directions for examining language use in practice. Here, we focus on two. First, because language is necessarily situated, and because different social groups have different lived experiences due to their different social positions (Hanna et al., 2020)-particularly groups at the intersections of multiple axes of oppression-we recommend that researchers and practitioners center work analyzing "bias" in NLP systems around the lived experiences of members of communities affected by these systems. Second, we recommend that the power relations between technologists and such communities be interrogated and reimagined. Researchers have pointed out that algorithmic fairness techniques, by proposing incremental technical mitigations-e.g., collecting new datasets or training better models-maintain these power relations by (a) assuming that automated systems should continue to exist, rather than asking whether they should be built at all, and ( There are many disciplines for researchers and practitioners to draw on when pursuing these directions. For example, in human-computer interaction, Hamidi et al. (2018) study transgender people's experiences with automated gender recognition systems in order to uncover how these systems reproduce structures of transgender exclusion by redefning what it means to perform gender "normally." Value-sensitive design provides a framework for accounting for the values of different stakeholders in the design of technology (e.g., Friedman et al., 2006;Friedman and Hendry, 2019;Le Dantec et al., 2009;Yoo et al., 2019), while participatory design seeks to involve stakeholders in the design process itself (Sanders, 2002;Muller, 2007;Simonsen and Robertson, 2013;DiSalvo et al., 2013). Participatory action research in education (Kemmis, 2006) and in language documentation and reclamation (Junker, 2018) is also relevant. In particular, work on language reclamation to support decolonization and tribal sovereignty (Leonard, 2012) and work in sociolinguistics focus-ing on developing co-equal research relationships with community members and supporting linguistic justice efforts (e.g., Bucholtz et al., 2014Bucholtz et al., , 2016Bucholtz et al., , 2019 provide examples of more emancipatory relationships with communities. Finally, several workshops and events have begun to explore how to empower stakeholders in the development and deployment of technology (Vaccaro et al., 2019;Givens and Morris, 2020;Sassaman et al., 2020) 4 and how to help researchers and practitioners consider when not to build systems at all (Barocas et al., 2020).
As a starting point for engaging with communities affected by NLP systems, we therefore provide the following concrete research questions: . How do communities become aware of NLP systems? Do they resist them, and if so, how? . What additional costs are borne by communities for whom NLP systems do not work well? . Do NLP systems shift power toward oppressive institutions (e.g., by enabling predictions that communities do not want made, linguistically based unfair allocation of resources or opportunities (Rosa and Flores, 2017), surveillance, or censorship), or away from such institutions? . Who is involved in the development and deployment of NLP systems? How do decision-making processes maintain power relations between technologists and communities affected by NLP systems? Can these processes be changed to reimagine these relations?

Case study
To illustrate our recommendations, we present a case study covering work on African-American English (AAE). 5 Work analyzing "bias" in the context of AAE has shown that part-of-speech taggers, language identifcation systems, and dependency parsers all work less well on text containing features associated with AAE than on text without these features (Jørgensen et al., , 2016, and that toxicity detection systems score tweets containing features associated with AAE as more offensive than tweets without them . These papers have been critical for highlighting AAE as a language variety for which existing NLP systems may not work, illustrating their limitations. However, they do not conceptualize "racial bias" in the same way. The frst four of these papers simply focus on system performance differences between text containing features associated with AAE and text without these features. In contrast, the last two papers also focus on such system performance differences, but motivate this focus with the following additional reasoning: If tweets containing features associated with AAE are scored as more offensive than tweets without these features, then this might (a) yield negative perceptions of AAE; (b) result in disproportionate removal of tweets containing these features, impeding participation in online platforms and reducing the space available online in which speakers can use AAE freely; and (c) cause AAE speakers to incur additional costs if they have to change their language practices to avoid negative perceptions or tweet removal.
More importantly, none of these papers engage with the literature on AAE, racial hierarchies in the U.S., and raciolinguistic ideologies. By failing to engage with this literature-thereby treating AAE simply as one of many non-Penn Treebank varieties of English or perhaps as another challenging domain-work analyzing "bias" in NLP systems in the context of AAE fails to situate these systems in the world. Who are the speakers of AAE? How are they viewed? We argue that AAE as a language variety cannot be separated from its speakersprimarily Black people in the U.S., who experience systemic anti-Black racism-and the language ideologies that reinforce and justify racial hierarchies.
Even after decades of sociolinguistic efforts to legitimize AAE, it continues to be viewed as "bad" English and its speakers continue to be viewed as linguistically inadequate-a view called the defcit perspective (Alim et al., 2016;Rosa and Flores, 2017). This perspective persists despite demonstrations that AAE is rule-bound and grammatical (Mufwene et al., 1998;Green, 2002), in addition to ample evidence of its speakers' linguistic adroitness (e.g., Alim, 2004;Rickford and King, 2016). This perspective belongs to a broader set of raciolinguistic ideologies (Rosa and Flores, 2017), which also produce allocational harms; speakers of AAE are frequently penalized for not adhering to dominant language practices, including in the education system (Alim, 2004;Terry et al., 2010), when seeking housing (Baugh, 2018), and in the judicial system, where their testimony is misunderstood or, worse yet, disbelieved (Rickford and King, 2016;Jones et al., 2019). These raciolinguistic ideologies position racialized communities as needing linguistic intervention, such as language education programs, in which these and other harms can be reduced if communities accommodate to dominant language practices (Rosa and Flores, 2017).
In the technology industry, speakers of AAE are often not considered consumers who matter. For example, Benjamin (2019) recounts an Apple employee who worked on speech recognition for Siri: "As they worked on different English dialects -Australian, Singaporean, and Indian English -[the employee] asked his boss: 'What about African American English?' To this his boss responded: 'Well, Apple products are for the premium market."' The reality, of course, is that speakers of AAE tend not to represent the "premium market" precisely because of institutions and policies that help to maintain racial hierarchies by systematically denying them the opportunities to develop wealth that are available to white Americans (Rothstein, 2017)an exclusion that is reproduced in technology by countless decisions like the one described above.
Engaging with the literature outlined above situates the system behaviors that are described as "bias," providing a foundation for normative reasoning. Researchers and practitioners should be concerned about "racial bias" in toxicity detection systems not only because performance differences impair system performance, but because they reproduce longstanding injustices of stigmatization and disenfranchisement for speakers of AAE. In re-stigmatizing AAE, they reproduce language ideologies in which AAE is viewed as ungrammatical, uneducated, and offensive. These ideologies, in turn, enable linguistic discrimination and justify enduring racial hierarchies (Rosa and Flores, 2017). Our perspective, which understands racial hierarchies and raciolinguistic ideologies as structural conditions that govern the development and deployment of technology, implies that techniques for measuring or mitigating "bias" in NLP systems will necessarily be incomplete unless they interrogate and dismantle these structural conditions, including the power relations between technologists and racialized communities.
We emphasize that engaging with the literature on AAE, racial hierarchies in the U.S., and raciolinguistic ideologies can generate new lines of engagement. These lines include work on the ways that the decisions made during the development and deployment of NLP systems produce stigmatization and disenfranchisement, and work on AAE use in practice, such as the ways that speakers of AAE interact with NLP systems that were not designed for them. This literature can also help researchers and practitioners address the allocational harms that may be produced by NLP systems, and ensure that even well-intentioned NLP systems do not position racialized communities as needing linguistic intervention or accommodation to dominant language practices. Finally, researchers and practitioners wishing to design better systems can also draw on a growing body of work on anti-racist language pedagogy that challenges the defcit perspective of AAE and other racialized language practices (e.g. Flores and Chaparro, 2018; Baker-Bell, 2019; Martínez and Mejía, 2019), as well as the work that we described in section 4.3 on reimagining the power relations between technologists and communities affected by technology.

Conclusion
By surveying 146 papers analyzing "bias" in NLP systems, we found that (a) their motivations are often vague, inconsistent, and lacking in normative reasoning; and (b) their proposed quantitative techniques for measuring or mitigating "bias" are poorly matched to their motivations and do not engage with the relevant literature outside of NLP. To help researchers and practitioners avoid these pitfalls, we proposed three recommendations that should guide work analyzing "bias" in NLP systems, and, for each, provided several concrete research questions. These recommendations rest on a greater recognition of the relationships between language and social hierarchies-a step that we see as paramount to establishing a path forward. Machine translation  Type-level embeddings  Type-level and contextualized embeddings  Dialogue generation  "Existing biases in data can be amplifed by models and the resulting output consumed by the public can infuence them, encourage and reinforce harmful stereotypes, or distort the truth. Automated systems that depend on these models can take problematic actions based on biased profling of individuals."

References
"Other biases can be inappropriate and result in negative experiences for some groups of people. Examples include, loan eligibility and crime recidivism prediction systems...and resumé sorting systems that believe that men are more qualifed to be programmers than women (Bolukbasi et al., 2016). Similarly, sentiment and emotion analysis systems can also perpetuate and accentuate inappropriate human biases, e.g., systems that consider utterances from one race or gender to be less positive simply because of their race or gender, or customer support systems that prioritize a call from an angry male over a call from the equally angry female." "[MT training] may incur an association of gender-specifed pronouns (in the target) and gender-neutral ones (in the source) for lexicon pairs that frequently collocate in the corpora. We claim that this kind of phenomenon seriously threatens the fairness of a translation system, in the sense that it lacks generality and inserts social bias to the inference. Moreover, the input is not fully correct (considering gender-neutrality) and might offend the users who expect fairer representations." "Learned models exhibit social bias when their training data encode stereotypes not relevant for the task, but the correlations are picked up anyway." "However, embeddings trained on human-generated corpora have been demonstrated to inherit strong gender stereotypes that refect social constructs....Such a bias substantially affects downstream applications....This concerns the practitioners who use the embedding model to build gender-sensitive applications such as a resume fltering system or a job recommendation system as the automated system may discriminate candidates based on their gender, as refected by their name. Besides, biased embeddings may implicitly affect downstream applications used in our daily lives. For example, when searching for 'computer scientist' using a search engine...a search algorithm using an embedding model in the backbone tends to rank male scientists higher than females' [sic], hindering women from being recognized and further exacerbating the gender inequality in the community." "[P]rominent word embeddings such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) encode systematic biases against women and black people (Bolukbasi et al., 2016;Garg et al., 2018), implicating many NLP systems in scaling up social injustice." "Since the goal of dialogue systems is to talk with users...if the systems show discriminatory behaviors in the interactions, the user experience will be adversely affected. Moreover, public commercial chatbots can get resisted for their improper speech."  Table 3: Examples of the categories into which the papers' motivations and proposed quantitative techniques for measuring or mitigating "bias" fall. Bold text in the quotes denotes the content that yields our categorizations.