Toward Gender-Inclusive Coreference Resolution

Correctly resolving textual mentions of people fundamentally entails making inferences about those people. Such inferences raise the risk of systemic biases in coreference resolution systems, including biases that can harm binary and non-binary trans and cis stakeholders. To better understand such biases, we foreground nuanced conceptualizations of gender from sociology and sociolinguistics, and develop two new datasets for interrogating bias in crowd annotations and in existing coreference resolution systems. Through these studies, conducted on English text, we confirm that without acknowledging and building systems that recognize the complexity of gender, we build systems that lead to many potential harms.


Introduction
Coreference resolution-the task of determining which textual references resolve to the same realworld entity-requires making inferences about those entities. Especially when those entities are people, coreference resolution systems run the risk of making unlicensed inferences, possibly resulting in harms either to individuals or groups of people. Embedded in coreference inferences are varied aspects of gender, both because gender can show up explicitly (e.g., pronouns in English, morphology in Arabic) and because societal expectations and stereotypes around gender roles may be explicitly or implicitly assumed by speakers or listeners. This can lead to significant biases in coreference resolution systems: cases where systems "systematically and unfairly discriminate against certain individuals or groups of individuals in favor of others" (Friedman and Nissenbaum, 1996, p. 332).
Gender bias in coreference resolution can manifest in many ways; work by Rudinger et al. (2018), Zhao et al. (2018a), andWebster et al. (2018) focused largely on the case of binary gender dis-crimination in trained coreference systems, showing that current systems over-rely on social stereotypes when resolving HE and SHE pronouns 1 (see §2). Contemporaneously, critical work in Human-Computer Interaction has complicated discussions around gender in other fields, such as computer vision (Keyes, 2018;Hamidi et al., 2018).
Building on both lines of work, and inspired by Keyes's (2018) study of vision-based automatic gender recognition systems, we consider gender bias from a broader conceptual frame than the binary "folk" model. We investigate ways in which folk notions of gender-namely that there are two genders, assigned at birth, immutable, and in perfect correspondence to gendered linguistic formslead to the development of technology that is exclusionary and harmful of binary and non-binary trans and cis people. 2 Addressing such issues is critical not just to improve the quality of our systems, but more pointedly to minimize the harms caused by our systems by reinforcing existing unjust social hierarchies (Lambert and Packer, 2019).
There are several stakeholder groups who may easily face harms when coreference systems is used (Blodgett et al., 2020). Those harms includes several possible harms, both allocational and representation harms (Barocas et al., 2017), including quality of service, erasure, and stereotyping harms. Following Bender's (2019) taxonomy of stakehold-1 Throughout, we avoid mapping pronouns to a "gender" label, preferring to use the pronoun directly, include (in English) SHE, HE, the non-binary use of singular THEY, and neopronouns (e.g., ZE/HIR, XEY/XEM), which have been in usage since at least the 1970s (Bustillos, 2011;Merriam-Webster, 2016;Bradley et al., 2019;Hord, 2016;Spivak, 1997). 2 Following GLAAD (2007), transgender individuals are those whose gender differs from the sex they were assigned at birth. This is in opposition to cisgender individuals, whose assigned sex at birth happens to correspond to their gender. Transgender individuals can either be binary (those whose gender falls in the "male/female" dichotomy) or non-binary (those for which the relationship is more complex). ers and Barocas et al.'s (2017) taxonomy of harms, there are several ways in which trans exclusionary coreference resolution systems can cause harm: Indirect: subject of query. If a person is the subject of a web query, pages about xem may be missed if "multiple mentions of query" is a ranking feature, and the system cannot resolve xyr pronouns ⇒ quality of service, erasure. Direct: by choice. If a grammar checker uses coreference, it may insist that an author writing hir third-person autobiography is repeatedly making errors when referring to hirself ⇒ quality of service, stereotyping, denigration. Direct: not by choice. If an information extraction system run on résumés relies on cisnormative assumptions, job experiences by a candidate who has transitioned and changed his pronouns may be missed ⇒ allocative, erasure. Many stakeholders. If a machine translation system uses discourse context to generate pronouns, then errors can results in directly misgendering subjects of the document being translated ⇒ quality of service, denigration, erasure.
To address such harms as well as understand where and how they arise, we need to complicate (a) what "gender" means and (b) how harms can enter into natural language processing (NLP) systems. Toward (a), we begin with a unifying analysis ( § 3) of how gender is socially constructed, and how social conditions in the world impose expectations around people's gender. Of particular interest is how gender is reflected in language, and how that both matches and potentially mismatches the way people experience their gender in the world. Then, in order to understand social biases around gender, we find it necessary to consider the different ways in which gender can be realized linguistically, breaking down what previously have been considered "gendered words" in NLP papers into finer-grained categories that have been identified in the sociolinguistics literature of lexical, referential, grammatical, and social gender.
Toward (b), we focus on how bias can enter into two stages of machine learning systems: data annotation ( § 4) and model definition ( § 5). We construct two new datasets: (1) MAP (a similar dataset to GAP (Webster et al., 2018) but without binary gender constraints) on which we can perform counterfactual manipulations and (2) GICoref (a fully annotated coreference resolution dataset written by and about trans people). 3 In all cases, we focus largely on harms due to over-and underrepresentation (Kay et al., 2015), replicating stereotypes (Sweeney, 2013;Caliskan et al., 2017) (particular those that are cisnormative and/or heteronormative), and quality of service differentials (Buolamwini and Gebru, 2018).
The primary contributions of this paper are: (1) Connecting existing work on gender bias in NLP to sociological and sociolinguistic conceptions of gender to provide a scaffolding for future work on analyzing "gender bias in NLP" ( §3).
(2) Developing an ablation technique for measuring gender bias in coreference resolution annotations, focusing on the human bias that can enter into annotation tasks ( §4). (3) Constructing a new dataset, the Gender Inclusive Coreference dataset (GICOREF), for testing performance of coreference resolution systems on texts that discuss non-binary and binary transgender people ( §5).

Related Work
There are four recent papers that consider gender bias in coreference resolution systems. Rudinger et al. (2018) evaluates coreference systems for evidence of occupational stereotyping, by constructing Winograd-esque (Levesque et al., 2012) test examples. They find that humans can reliably resolve these examples, but systems largely fail at them, typically in a gender-stereotypical way. In contemporaneous work, Zhao et al. (2018a) proposed a very similar, also Winograd-esque scheme, also for measuring gender-based occupational stereotypes. In addition to reaching similar conclusions to Rudinger et al. (2018), this work also used a similar "counterfactual" data process as we use in § 4.1 in order to provide additional training data to a coreference resolution system. Webster et al. (2018) produced the GAP dataset for evaluating coreference systems, by specifically seeking examples where "gender" (left underspecified) could not be used to help coreference. They found that coreference systems struggle in these cases, also pointing to the fact that some success of current coreference systems is due to reliance on (binary) gender stereotypes. Finally, Ackerman (2019) presents an alternative breakdown of gender than we use ( § 3), and proposes matching criteria for model-ing coreference resolution linguistically, taking a trans-inclusive perspective on gender.

Linguistic & Social Gender
The concept of gender is complex and contested, covering (at least) aspects of a person's internal experience, how they express this to the world, how social conditions in the world impose expectations on them (including expectations around their sexuality), and how they are perceived and accepted (or not). When this complex concept is realized in language, the situation becomes even more complex: linguistic categories of gender do not even remotely map one-to-one to social categories. As observed by Bucholtz (1999): "Attempts to read linguistic structure directly for information about social gender are often misguided." For instance, when working in a language like English which formally marks gender on pronouns, it is all too easy to equate "recognizing the pronoun that corefers with this name" with "recognizing the real-world gender of referent of that name." Furthermore, despite the impossibility of a perfect alignment with linguistic gender, it is generally clear that an incorrectly gendered reference to a person (whether through pronominalization or otherwise) can be highly problematic (Johnson et al., 2019;McLemore, 2015). This process of misgendering is problematic for both trans and cis individuals to the extent that transgender historian Stryker (2008) writes: "[o]ne's gender identity could perhaps best be described as how one feels about being referred to by a particular pronoun."

Sociological Gender
Many modern trans-inclusive models of gender recognize that gender encompasses many different aspects. These aspects include the experience that one has of gender (or lack thereof), the way that one expresses one's gender to the world, and the way that normative social conditions impose gender norms, typically as a dichotomy between masculine and feminine roles or traits (Kramarae and Tre ichler, 1985;West and Zimmerman, 1987;Butler, 1990;Risman, 2009;Serano, 2007). Gender selfdetermination, on the other hand, holds that each person is the "ultimate authority" on their own gender identity (Zimman, 2019;Stanley, 2014), with Zimman (2019) further arguing the importance of the role language plays in that determination. Such trans-inclusive models deconflate anatomical and biological traits and the sex that a person had assigned to them at birth from one's gendered position in society; this includes intersex people, whose anatomical/biological factors do not match the usual designational criteria for either sex. Transinclusive views typically recognize that gender exists beyond the regressive "female"/"male" binary 4 ; additionally, one's gender may shift by time or context (often "genderfluid"), and some people do not experience gender at all (often "agender") (Kessler and McKenna, 1978;Schilt and Westbrook, 2009;Darwin, 2017;Richards et al., 2017). In §5 we analyze the degree to which NLP papers make transinclusive or trans-exclusive assumptions.
Social gender refers to the imposition of gender roles or traits based on normative social conditions (Kramarae and Treichler, 1985), which often includes imposing a dichotomy between feminine and masculine (in behavior, dress, speech, occupation, societal roles, etc.). Ackerman (2019) highlights a highly overlapping concept, "bio-social gender", which consists of gender role, gender expression, and gender identity. Taking gender role as an example, upon learning that a nurse is coming to their hospital room, a patient may form expectations that this person is likely to be "female," and may generate expectations around how their face or body may look, how they are likely to be dressed, how and where hair may appear, how to refer to them, and so on. This process, often referred to as gendering (Serano, 2007) occurs both in real world interactions, as well as in purely linguistic settings (e.g., reading a newspaper), in which readers may use social gender clues to assign gender(s) to the real world people being discussed.
Grammatical gender, similarly defined in Ackerman (2019), is nothing more than a classification of nouns based on a principle of grammatical agreement. In "gender languages" there are typically two or three grammatical genders that have, for animate or personal references, considerable correspondence between a FEM (resp. MASC) grammatical gender and referents with female-(resp. male-) 5 social gender. In comparison, "noun class languages" have no such correspondence, and typically many more classes. Some languages have no grammatical gender at all; English is generally seen as one (Nissen, 2002;Baron, 1971) (though this is contested (Bjorkman, 2017)).
Referential gender (similar, but not identical to Ackerman's (2019) "conceptual gender") relates linguistic expressions to extra-linguistic reality, typically identifying referents as "female," "male," or "gender-indefinite." Fundamentally, referential gender only exists when there is an entity being referred to, and their gender (or sex) is realized linguistically. The most obvious examples in English are gendered third person pronouns (SHE, HE), including neopronouns (ZE, EM) and singular THEY 6 , but also includes cases like "policeman" when the intended referent of this noun has social gender "male" (though not when "policeman" is used non-referentially, as in "every policeman needs to hold others accountable").
Lexical gender refers to an extra-linguistic properties of female-ness or male-ness in a nonreferential way, as in terms like "mother" as well as gendered terms of address like "Mrs." Importantly, lexical gender is a property of the linguistic unit, not a property of its referent in the real world, which may or may not exist. For instance, in "Every son loves his parents", there is no real world referent of "son" (and therefore no referential gender), yet it still (likely) takes HIS as a pronoun anaphor because "son" has lexical gender MASC.

Social and Linguistic Gender Interplays
The relationship between these aspects of gender is complex, and none is one-to-one. The referential gender of an individual (e.g., pronouns in English) may or may not match their social gender and this may change by context. This can happen in the case of people whose everyday life experience of their gender fluctuates over time (at any interval), as well as in the case of drag performers (e.g., some men who perform drag are addressed as SHE while performing, and HE when not (for Transgender Equality, 2017)). The other linguistic forms of gender (grammatical, lexical) also need not match each other, nor match referential gender (Hellinger and Motschenbacher, 2015).
Social gender (societal expectations, in particular) captures the observation that upon hearing "My cousin is a librarian", many speakers will infer "female" for "cousin", because of either an entailment of "librarian" or some sort of probabilistic inference (Lyons, 1977), but not based on either grammatical gender (which does not exist in English) or lexical gender. We focus on English, which has no grammatical gender, but does have lexical gender. English also marks referential gender on singular third person pronouns.
Below, we use this more nuanced notion of different types of gender to inspect how bias play out in coreference resolution systems. These biases may arise in the context of any of these notions of gender, and we encourage future work to extend care over and be explicit about what notions of gender are being utilized and when.

Bias in Human Annotation
A possible source of bias in coreference systems comes from human annotations on the data used to train them. Such biases can arise from a combination of (possibly) underspecified annotations guidelines and the positionality of annotators themselves. In this section, we study how different aspects of linguistic notions impact an annotator's  judgments of anaphora. This parallels Ackerman (2019) linguistic analysis, in which a Broad Matching Criterion is proposed, which posits that "matching gender requires at least one level of the mental representation of gender to be identical to the candidate antecedent in order to match." Our study can be seen as evaluating which conceptual properties of gender are most salient in human judgments. We start with natural text in which we can cast the coreference task as a binary classification problem ("which of these two names does this pronoun refer to?") inspired by Webster et al. (2018). We then generate "counterfactual augmentations" of this dataset by ablating the various notions of linguistic gender described in §3.2, similar to Zmigrod et al. (2019). We finally evaluate the impact of these ablations on human annotation behavior to answer the question: which forms of linguistic knowledge are most essential for human annotators to make consistent judgments. See Appendix A for examples of how linguistic gender may be used to infer social gender.

Ablation Methodology
In order to determine which cues annotators are using and the degree to which they use them, we construct an ablation study in which we hide various aspects of gender and evaluate how this impacts annotators' judgments of anaphoricity. We construct binary classification examples taken from Wikipedia pages, in which a single pronoun is selected, and two possible antecedent names are given, and the annotator must select which one. We cannot use Webster et al.'s GAP dataset directly, because their data is constrained that the "gender" of the two possible antecedents is "the same" 7 ; for us, we are specifically interested in how annotators make decisions even when additional gender information is available. Thus, we construct a dataset called Maybe Ambiguous Pronoun (MAP) follow-7 It is unclear from the GAP dataset what notion of "gender" is used, nor how it was determined to be "the same." ing Webster et al.'s approach, but we do not restrict the two names to match gender.
In ablating gender information, one challenge is that removing social gender cues (e.g., "nurse" tending female) is not possible because they can exist anywhere. Likewise, it is not possible to remove syntactic cues in a non-circular manner. For example in (1), syntactic structure strongly suggests the antecedent of "herself" is "Liang", making it less likely that "He" corefers with Liang later (though it is possible, and such cases exist in natural data due either to genderfluidity or misgendering).
Fortunately, it is possible to enumerate a high coverage list of English terms that signal lexical gender: terms of address (Mrs., Mr.) and semantically gendered nouns (mother). 8 We assembled a list by taking many online lists (mostly targeted at English language learners), merging them, and manual filtering. The assembling process and the final list is published with the MAP dataset and its datasheet.
To execute the "hiding" of various aspects of gender, we use the following substitutions: (a) ¬PRO: Replace third person pronouns with gender neutral variants (THEY, XEY, ZE). (b) ¬NAME: Replace names by random names with only a first initial and last name. (c) ¬SEM: Replace semantically gendered nouns with gender-indefinite variants. (d) ¬ADDR: Remove terms of address. 9 See Figure 1 for an example of all substitutions.
We perform two sets of experiments, one following a "forward selection" type ablation (start with everything removed and add each back in oneat-a-time) and one following "backward selection" (remove each separately). Forward selection is necessary in order to de-conflate syntactic cues from stereotypes; while backward selection gives a sense of how much impact each type of gender cue has in the context of all the others.
We begin with ZERO, in which we apply all four substitutions. Since this also removes gender cues from the pronouns themselves, an annotator cannot substantially rely on social gender to perform these resolutions. We next consider adding back in the original pronouns (always HE or SHE here), yielding ¬NAME ¬SEM ¬ADDR. Any difference in annotation behavior between ZERO and ¬NAME ¬SEM ¬ADDR can only be due to social gender stereotypes. The next setting, ¬SEM ¬ADDR removes both forms of lexical gender (semantically gendered nouns and terms of address); differences between ¬SEM ¬ADDR and ¬NAME ¬SEM ¬ADDR show how much names are relied on for annotation. Similarly, ¬NAME ¬ADDR removes names and terms of address, showing the impact of semantically gendered nouns, and ¬NAME ¬SEM removes names and semantically gendered nouns, showing the impact of terms of address.
In the backward selection case, we begin with ORIG, which is the unmodified original text. To this, we can apply the pronoun filter to get ¬PRO; differences in annotation between ORIG and ¬PRO give a measure of how much any sort of genderbased inference is used. Similarly, we get ¬NAME by only removing names, which gives a measure of how much names are used (in the context of all other cues); we get ¬SEM by only removing semantically gendered words; and ¬ADDR by only removing terms of address.

Annotation Results
We construct examples using the methodology defined above. We then conduct annotation experiments using crowdworkers on Amazon Mechanical Turk following the methodology by which the original GAP corpus was created 10 . Because we wanted to also capture uncertainty, we ask the crowdworkers how sure they are in their choices, between "definitely" sure, "probably" sure and "unsure." Figure 2 shows the human annotation results as binary classification accuracy for resolving the pronoun to the antecedent. We can see that removing pronouns leads to significant drop in accuracy. This indicates that gender-based inferences, especially social gender stereotypes, play the most significant Moreover, if we compare ORIG with columns left to it, we see that name is another significant cue for annotator judgments, while lexical gender cues do not have significant impacts on human annotation accuracies. This is likely in part due to the low appearance frequency of lexical gender cues in our dataset. Every example has pronouns and names, whereas 49% of the examples have semantically gendered nouns but only 3% of the examples include terms of address. We also note that if we compare ¬NAME ¬SEM ¬ADDR to ¬SEM ¬ADDR and ¬NAME ¬ADDR, accuracy drops when removing gender cues. Though the differences are not statistically significant, we did not expect the accuracy drop.
Finally, we find annotators' certainty values follow the same trend as the accuracy: annotators have a reasonable sense of when they are unsure. We also note that accuracy score are essentially the same for ZERO and ¬PRO, which suggests that once explicit binary gender is gone from pronouns, the impact of any other form of linguistic gender in annotator's decisions is also removed.

Bias in Model Specifications
In addition to biases that can arise from the data that a system is trained on, as studied in the previ-ous section, bias can also come from how models are structured. For instance, a system may fail to recognize anything other than a dictionary of fixed pronouns as possible referents to entities. Here, we analyze prior work in models for coreference resolution in three ways. First, we do a literature study to quantify how NLP papers discuss gender. Second, similar to Zhao et al. (2018a) and Rudinger et al. (2018), we evaluate five freely available systems on the ablated data from §4. Third, we evaluate these systems on the dataset we created: Gender Inclusive Coreference (GICOREF).

Cis-normativity in published NLP papers
In our first study, we adapt the approach Keyes (2018) took for analyzing the degree to which computer vision papers encoded trans-exclusive models of gender. In particular, we began with a random sample of ∼150 papers from the ACL anthology that mention the word "gender" and coded them according to the following questions: • Does the paper discuss coreference resolution?
• Does the paper study English? • L.G: Does the paper deal with linguistic gender (grammatical gender or gendered pronouns)? • S.G: Does the paper deal with social gender? • L.G =S.G: (If yes to L.G and S.G:) Does the paper distinguish linguistic from social gender? • S.G Binary: (If yes to S.G:) Does the paper explicitly or implicitly assume that social gender is binary? • S.G Immutable: (If yes to S.G:) Does the paper explicitly or implicitly assume social gender is immutable? • They/Neo: (If yes to S.G and to English:) Does the paper explicitly consider uses of definite singular "they" or neopronouns? The results of this coding are in Table 1 (the full annotation is in Appendix B). We see out of the 22 coreference papers analyzed, the vast majority conform to a "folk" theory of language: Only 5.5% distinguish social from linguistic gender (despite it being relevant); Only 5.6% explicitly model gender as inclusive of non-binary identities; No papers treat gender as anything other than completely immutable; 11 11 The most common ways in which papers implicitly assume that social gender is immutable is either 1) by relying on external knowledge bases that map names to "gender"; or 2) by scraping a history of a user's social media posts or emails and assuming that their "gender" today matches the gender of  (of 27) 5.5% (of 18) S.G Binary? 92.8% (of 84) 94.4% (of 18) S.G Immutable? 94.5% (of 74) 100.0% (of 14) They/Neo? 3.5% (of 56) 7.1% (of 14) Only 7.1% (one paper!) considers neopronouns and/or specific singular THEY. The situation for papers not specifically about coreference is similar (the majority of these papers are either purely linguistic papers about grammatical gender in languages other than English, or papers that do "gender recognition" of authors based on their writing; May (2019) discusses the (re)production of gender in automated gender recognition in NLP in much more detail). Overall, the situation more broadly is equally troubling, and generally also fails to escape from the folk theory of gender. In particular, none of the differences are significant at a p = 0.05 level except for the first two questions, due to the small sample size (according to an n − 1 chi-squared test). The result is that although we do not know exactly what decisions are baked in to all systems, the vast majority in our study (including two papers by one of the authors (Daumé and Marcu, 2005;Orita et al., 2015)) come with strong gender binary assumptions, and exist within a broader sphere of literature which erases non-binary and binary trans identities.

System performance on MAP
Next, we analyze the effect that our different ablation mechanisms have on existing coreference resolutions systems. In particular, we run five coreference resolution systems on our ablated data: the AI2 system (AI2; Gardner et al., 2017), hugging face (HF; Wolf, 2017), which is a neural system based on spacy, and the Stanford deterministic (SfdD; Raghunathan et al., 2010), statistical (SfdS; Clark and Manning, 2015) and neural (SfdN; Clark and Manning, 2016) systems. Figure 3 shows the results. We can see that the system accuracies mostly follow the same pattern as human accuracy scores, though all are significantly lower than human results. Accuracy scores for systems drop that historical record. dramatically when we ablate out referential gender in pronouns. This reveals that those coreference resolution systems reply heavily on gender-based inferences. In terms of each systems, HF and SfdN systems have similar results and outperform other systems in most cases. SfdD accuracy drops significantly once names are ablated.
These results echo and extend previous observations made by Zhao et al. (2018a), who focus on detecting stereotypes within occupations. They detect gender bias by checking if the system accuracies are the same for cases that can be resolved by syntactic cues and cases that cannot, with original data and reversed-gender data. Similarly, Rudinger et al.
(2018) focus on detecting stereotypes within occupations as well. They construct dataset without any gender cues other than stereotypes, and check how systems perform with different pronouns -THEY, SHE, HE. Ideally, they should all perform the same because there is not any gender cues in the sentence. However, they find that systems do not work on "they" and perform better on "he" than "she". Our analysis breaks this stereotyping down further to detect which aspects of gender signals are most leveraged by current systems.

System behavior on gender-inclusive data
Finally, in order to evaluate current coreference resolution models in gender inclusive contexts we introduce a new dataset, GICOREF. Here we focused on naturally occurring data, but sampled specifically to surface more gender-related phenomena than may be found in, say, the Wall Street Journal.
Our new GICOREF dataset consists of 95 doc-  uments from three types of sources: articles from English Wikipedia about people with non-binary gender identities, articles from LGBTQ periodicals, and fan-fiction stories from Archive Of Our Own (with the respective author's permission) 12 . These documents were each annotated by both of the authors and adjudicated. 13 This data includes many examples of people who use pronouns other than SHE or HE (the dataset contains 27% HE, 20% SHE, 35% THEY, and 18% neopronouns, people who are genderfluid and whose names or pronouns change through the article, people who are misgendered, and people in relationships that are not heteronormative. In addition, incorrect references (misgendering and deadnaming 14 ) are explicitly annotated. 15 Two example annotated documents, one from Wikipedia, and one from Archive of Our Own, are provided in Appendix C and Appendix D.
We run the same systems as before on this dataset. Table 2 reports results according the standard coreference resolution evaluation metric LEA (Moosavi and Strube, 2016). Since no systems are implemented to explicitly mark incorrect references, and no current evaluation metrics address this case, we perform the same evaluation twice. One with incorrect references included as regular references in the ground truth; and other with incorrect references excluded. Due to the limited number of incorrect references in the dataset, the 12 See https://archiveofourown.org; thanks to Os Keyes for this suggestion. 13 We evaluate inter-annotator agreement by treating one annotation as gold standard and the other as system output and computing the LEA metric; the resulting F1-score is 92%. During the adjudication process we found that most of the disagreement are due to one of the authors missing/overlooking mentions, and rarely due to true "disagreement." 14 According to Clements (2017) deadnaming occurs when someone, intentionally or not, refers to a person who's transgender by the name they used before they transitioned. 15 Thanks to an anonymous reader of a draft version of this paper for this suggestion. difference of the results are not significant. Here we only report the latter.
The first observation is that there is still plenty room for coreference systems to improve; the best performing system achieves as F1 score of 34%, but the Stanford neural system's F1 score on CoNLL-2012 test set reaches 60% (Moosavi, 2020). Additionally, we can see system precision dominates recall. This is likely partially due to poor recall of pronouns other than HE and SHE. To analyze this, we compute the recall of each system for finding referential pronouns at all, regardless of whether they are correctly linked to their antecedents. We find that all systems achieve a recall of at least 95% for binary pronouns, a recall of around 90% on average for THEY, and a recall of around a paltry 13% for neopronouns (two systems-Stanford deterministic and Stanford neural-never identify any neopronouns at all).

Discussion and Moving Forward
Our goal in this paper was to analyze how gender bias exist in coreference resolution annotations and models, with a particular focus on how it may fail to adequately process text involving binary and non-binary trans referents. We thus created two datasets: MAP and GICOREF. Both datasets show significant gaps in system performance, but perhaps moreso, show that taking crowdworker judgments as "gold standard" can be problematic. It may be the case that to truly build gender inclusive datasets and systems, we need to hire or consult experiential experts (Patton et al., 2019;Young et al., 2019).
Moreover, although we studied crowdworkers on Mechanical Turk (because they are often employed as annotators for NLP resources), if other populations are used for annotation, it becomes important to consider their positionality and how that may impact annotations. This echoes a related finding in annotation of hate-speech that annotator positionality matters (Olteanu et al., 2019). More broadly, we found that trans-exclusionary assumptions around gender in NLP papers is made commonly (and implicitly), a practice that we hope to see change in the future because it fundamentally limits the applicability of NLP systems.
The primary limitation of our study and analysis is that it is limited to English. This is particularly limiting because English lacks a grammatical gender system, and some extensions of our work to languages with grammatical gender are non-trivial.
We also emphasize that while we endeavored to be inclusive, our own positionality has undoubtedly led to other biases. One in particular is a largely Western bias, both in terms of what models of gender we use and also in terms of the data we annotated. We have attempted to partially compensate for this bias by intentionally including documents with non-Western non-binary expressions of gender in the GICoref dataset 16 , but the dataset nonetheless remains Western-dominant.
Additionally, our ability to collect naturally occurring data was limited because many sources simply do not yet permit (or have only recently permitted) the use of gender inclusive language in their articles. This led us to counterfactual text manipulation, which, while useful, is essentially impossible to do flawlessly. Moreover, our ability to evaluate coreference systems with data that includes incorrect references was limited as well, because current systems do not mark any forms of misgendering or deadnaming explicitly, and current metrics do not take this into account. Finally, because the social construct of gender is fundamentally contested, some of our results may apply only under some frameworks.
We hope this paper can serve as a roadmap for future studies. In particular, the gender taxonomy we presented, while not novel, is (to our knowledge) previously unattested in discussions around gender bias in NLP systems; we hope future work in this area can draw on these ideas. We also hope that developers of datasets or systems can use some of our analysis as inspiration for how one can attempt to measure-and then root out-different forms of bias in coreference resolution systems and NLP systems more broadly.

A Examples of Possible Bias in Data Annotation
Bias can enter coreference resolution datasets, which we use to train our systems, through annotation phase. Annotators may use linguistic notions to infer social gender. For instance, consider (2) below, in which an annotator is likely to determine that "her" refers to "Mary" and not "John" due to assumptions on likely ways that names may map to pronouns (or possibly by not considering that SHE pronouns could refer to someone named "John"). While in (3), an annotator is likely to have difficulty making a determination because both "Sue" and "Mary" suggest "her". In (4), an annotator lacking knowledge of name stereotypes on typical Chinese and Indian names (plus the fact that given names in Chineseespecially when romanized -generally do not signal gender strongly), respectively, will likewise have difficulty.
(2) John and Mary visited her mother.
(3) Sue and Mary visited her mother.
In all these cases, the plausible rough inference is that a reader takes a name, uses it to infer the social gender of the extra-linguistic referent. Later the reader sees the SHE pronoun, infers the referential gender of that pronoun, and checks to see if they match. An equivalent inference happens not just for names, but also for lexical gender references (both gendered nouns (5) and terms of address (6)), grammatical gender references (in gender languages like Arabic (7)), and social gender references (8). The last of these ( (8)) is the case in which the correct referent is likely to be least clear to most annotators, and also the case studied by Rudinger et al. (2018)  (8) The nurse and the actor visited her mother.

B Annotation of ACL Anthology Papers
Below we list the complete set of annotations we did of the papers described in §5.1. For each of the papers considered, we annotate the following items: • Coref: Does the paper discuss coreference resolution?
• L.G: Does the paper deal with linguistic gender (grammatical gender or gendered pronouns)?
• S.G: Does the paper deal with social gender?
• Eng: Does the paper study English?
• L =G: (If yes to L.G and S.G:) Does the paper distinguish linguistic from social gender?
• 0/1: (If yes to S.G:) Does the paper explicitly or implicitly assume that social gender is binary?
• Imm: (If yes to S.G:) Does the paper explicitly or implicitly assume social gender is immutable?
• Neo: (If yes to S.G and to English:) Does the paper explicitly consider uses of definite singular "they" or neopronouns? For each of these, we mark with [Y] if the answer is yes, [N] if the answer is no, and [-] if this question is not applicable (ie it doesn't pass the conditional checks).

Citation
Coref L.G S.G Eng L =S 0/1 Imm Neo Kholy and Habash (2012) N  Dana Alix Zzyym A is an Intersex activist and former sailor who was the first military veteran in the United States to seek a non -binary gender U.S. passport , in a lawsuit Zzyym A v. Pompeo C .

Early life
Zzyym A has expressed that their A childhood as a military brat made it out of the question for them A to be associated with the queer community as a youth due to the prevalence of homophobia in the armed forces . Their A parents B hid Zzyym A 's status as intersex from them A and Zzyym A discovered their A identity and the surgeries their A parents B had approved for them A by themselves B after their A Navy service . In 1978 , Zzyym A joined the Navy as a machinist 's mate .

Activism
Zzyym A has been an avid supporter of the Intersex Campaign for Equality . Despite dreading their A first true series of final exams , Crona A 's relieved to have a particularly absorbative memory , lucky to recall all the material they A 'd been required to catch up on . Half a semester of attendance , a whole year of course content .
The only true moment of discomfort came when they A 'd arrived at the essay portion . Thankful it was easy enough to answer , however , their A subtle eye -roll stemmed entirely from just how much writing it asked of them A , hands already beginning to ache at the thought of scrawling out two pages on the origins , history , and importance of partnered and grouped soul resonance .
By the end of it all , their A neck , wrist , back , and ribs ached from the strain of their A typical , hunched posturea habit they A defaulted to , and Miss Marie B silently wished they A 'd be more mindful of . It was a relief , at least to them A , not to be the last one out of the lecture hall . Booklet turned in , they A left the room as quietly as possible and lingered just outside , an air of hesitance settling upon them A as they A considered what to do now that , it seemed , everything was over with . No more class , no more lessons , just ... students on break from their studies for the season .
" Kind of a breeze , was n't it ? " Evans C ' voice echoes in the arched hall and Crona A 's shoulders jump , their A frame still a tense and anxious mess .
" Oh , " they A sigh , " I A ... I A suppose so . It was n't ... necessarily hard . " Crona A answers , putting forth a vaguely forced smile .
Smiling with the assumed purpose of making Soul C comfortable with the interaction . A defense mechanism . " I A -I A guess , for a final , it was easier than I A expected ... everyone ... made it sound like it 'd be difficult . " " If by everyone , you A mean Black Star D , then yeah , " Soul C chuckles , " he D does n't really do well on ' em ... bad test -taker . " " Ah , " their A facade falls just in time to be replaced by a much more genuine grin . Of the little they A 'd spent talking to Black Star D , he D certainly had confidence and skill enough to make up for the lost exam points given his D performance in every other grading category .
" That ... makes sense . " " Maka E 's always the first one done when it comes to this stuff , she E practically studies in her E sleep . I C 'm convinced she E must be practicing clairvoyance the way she E burns through essay questions , " Soul C laughs , turning to the meek teen A who gives him C a simple nod in response .
Determined not to let an impending awkward silence fall between them F , Soul C pipes up again , " So , are you A staying here for break ? " " Ye -well , I A ... I A think so , " they A begin , stuttering , but encouraged to continue by a cock of Soul C 's head ; a social cue even they A could read , " The professor H ... and Miss Marie B G asked if I A 'd like to come and stay with them G for the time being . " " Oh , huh , Stein H and Marie B G ? Nice , " his C brows lift , clearly some varying degree of happy for the other A . The optimism is short -lived , observing as Crona A 's expression falls back to its characteristic expressionless gaze .
" It seems like you A 've got a good thing going with those two G . " " I A have n't decided , yet , if I A should accept the invitation , " they A shift a bit where they A stand . Never having been the best at reassuring others , even his C own meister A , Soul C kept his C mouth shut to avoid stuttering while he C searched for the right words a web of thoughts .
" Y ' A know , I C think it 's less of an invitation and more of an extended welcome . " The other A raises their A head , taken aback , " Oh , " Crona A mutters , in a poignant tone , " I A ... never considered something like that . " Soul C does n't leave much wiggle room for their A mood to fall any further ( nothing past a flat -lipped frown ) , " They G 'd probably love to have you A , I C bet they G drive each other nuts sometimes all by themselves G . " Though Evans C wo n't admit it , he C knows it 's all too likely Stein H might actually put some more effort into taking care of himself H if he H had someone else besides Marie B to look after .
" I A -I A see , " they A exhale with a nod , giving Soul C a hint of affirmation that he C 'd done something to boost the kid A 's confidence .
" I C mean , it 's got ta be lonely not to mention boring hanging here all summer ... and the weather , " Soul C nearly gasps , dramatizing it for added effect , " Oh , man , I C do n't know how you A can stay cooped up in that room of yours A when it 's so nice out , " he C grins .
" But ... meh . Different strokes . I C ca n't judge . " His C comments comfort them A , an for a moment they A forget how this came to be . The cathedral in Italy , Lady Medusa I 's wrath , and the black blood that infected him C . Every moment they A spent in the presence of Soul Evans C builds always up to this ; fixation on the memories of their J first encounters and all the pain they A 've caused him C , the pain they A 've caused he C and Maka E K both . As quickly as Soul C had lifted the swordsman A 's spirits , they A 'd weighed themselves A down once more . It seemed so normal , though . Soul C could n't bring himself C to feel any sense of accomplishment in the coaxing -out of Crona A 's smile when the return of their A self doubt was as certain as the sun in the sky . His C own stubbornness could n't let his C diminished self worth lie .
With another encouraging smile , rows of sharpened incisors appearing oddly charismatic , he C opens his C mouth to speak -but finds himself C cut off before he C can even squeeze a word in .
" Soul C , I A 'm sorry , " the meister A blurts . Having been pent -up for months , the apology comes forth without inhibition , rolling effortlessly off their A tongue .
" Sorry ... ? For what ? " Evans C quirks a brow , chuckling . He C adjusts his C stance to face Crona A with the whole of his C body , maintaining his C positive demeanor . " F -for what ... ? " They A stammer , shaking their A head . For all their A remorse , they A thought this would have been obvious . " For everything , it 's ... the first time we F dueled , I A was the enemy ! I A -I A almost killed you C , I A -I A ... I A really , really hurt you C , " they A answer , still so sick with guild that even their A confession of responsibility is tainted with frustration .
Soul C seems stunned for a moment before harnessing his C quick wit .