Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

In this paper, we introduce a new benchmark for co-reference resolution focused on gender bias, WinoBias. Our corpus contains Winograd-schema style sentences with entities corresponding to people referred by their occupation (e.g. the nurse, the doctor, the carpenter). We demonstrate that a rule-based, a feature-rich, and a neural coreference system all link gendered pronouns to pro-stereotypical entities with higher accuracy than anti-stereotypical entities, by an average difference of 21.1 in F1 score. Finally, we demonstrate a data-augmentation approach that, in combination with existing word-embedding debiasing techniques, removes the bias demonstrated by these systems in WinoBias without significantly affecting their performance on existing datasets.


Introduction
Coreference resolution is a task aimed at identifying phrases (mentions) referring to the same entity.Various approaches, including rule-based (Raghunathan et al., 2010), feature-based (Durrett and Klein, 2013;Peng et al., 2015a), and neuralnetwork based (Clark and Manning, 2016;Lee et al., 2017) have been proposed.While significant advances have been made, systems carry the risk of relying on societal stereotypes present in training data that could significantly impact their performance for some demographic groups.
In this work, we test the hypothesis that coreference systems exhibit gender bias by creating a new challenge corpus, WinoBias.This dataset follows the winograd format (Hirst, 1981;Rahman and Ng, 2012;Peng et al., 2015b), and contains references to people using a vocabulary of 40 occupations.It contains two types of challenge sentences that require linking gendered pro- The physician called the secretary and told her the cancel the appointment.
The secretary called the physician and told him about a new patient.
The secretary called the physician and told her about a new patient.
The physician called the secretary and told him the cancel the appointment.

Type 2
The physician hired the secretary because she was highly recommended.
The physician hired the secretary because he was highly recommended.
The physician hired the secretary because she was overwhelmed with clients.

Type 1
The physician hired the secretary because he was overwhelmed with clients.Male and female entities are marked in solid blue and dashed orange, respectively.For each example, the gender of the pronominal reference is irrelevant for the co-reference decision.Systems must be able to make correct linking predictions in pro-stereotypical scenarios (solid purple lines) and anti-stereotypical scenarios (dashed purple lines) equally well to pass the test.Importantly, stereotypical occupations are considered based on US Department of Labor statistics.nouns to either male or female stereotypical occupations (see the illustrative examples in Figure 1).None of the examples can be disambiguated by the gender of the pronoun but this cue can potentially distract the model.We consider a system to be gender biased if it links pronouns to occupations dominated by the gender of the pronoun (pro-stereotyped condition) more accurately than occupations not dominated by the gender of the pronoun (anti-stereotyped condition).The corpus can be used to certify a system has gender bias. 1e use three different systems as prototypical examples: the Stanford Deterministic Coreference System (Raghunathan et al., 2010), the Berkeley Coreference Resolution System (Durrett and Klein, 2013) and the current best published system: the UW End-to-end Neural Coreference Resolution System (Lee et al., 2017).Despite qualitatively different approaches, all systems exhibit gender bias, showing an average difference in performance between pro-stereotypical and antistereotyped conditions of 21.1 in F1 score.Finally we show that given sufficiently strong alternative cues, systems can ignore their bias.
In order to study the source of this bias, we analyze the training corpus used by these systems, Ontonotes 5.0 (Weischedel et al., 2012). 2ur analysis shows that female entities are significantly underrepresented in this corpus.To reduce the impact of such dataset bias, we propose to generate an auxiliary dataset where all male entities are replaced by female entities, and vice versa, using a rule-based approach.Methods can then be trained on the union of the original and auxiliary dataset.In combination with methods that remove bias from fixed resources such as word embeddings (Bolukbasi et al., 2016), our data augmentation approach completely eliminates bias when evaluating on WinoBias , without significantly affecting overall coreference accuracy.

WinoBias
To better identify gender bias in coreference resolution systems, we build a new dataset centered on people entities referred by their occupations from a vocabulary of 40 occupations gathered from the US Department of Labor, shown in Table 1. 3 We use the associated occupation statistics to determine what constitutes gender stereotypical roles (e.g.90% of nurses are women in this survey).Entities referred by different occupations are paired and used to construct test case scenarios.Sentences are duplicated using male and female pronouns, and contain equal numbers of correct coreference decisions for all occupations.In total, the dataset contains 3,160 sentences, split equally for development and test, created by researchers familiar with the project.Sentences were created to follow two prototypical templates but annotators were encouraged to come up with scenarios where entities could be interacting in plausible ways.Templates were selected to be challenging ganized by the percent of people in the occupation who are reported as female.When woman dominate profession, we call linking the noun phrase referring to the job with female and male pronoun as 'pro-stereotypical', and 'anti-stereotypical', respectively.Similarly, if the occupation is male dominated, linking the noun phrase with the male and female pronoun is called, 'pro-stereotypical' and 'anti-steretypical', respectively.
and designed to cover cases requiring semantics and syntax separately.4Evaluation To evaluate models, we split the data in two sections: one where correct coreference decisions require linking a gendered pronoun to an occupation stereotypically associated with the gender of the pronoun and one that requires linking to the anti-stereotypical occupation.We say that a model passes the WinoBias test if for both Type 1 and Type 2 examples, prostereotyped and anti-stereotyped co-reference decisions are made with the same accuracy.

Gender Bias in Co-reference
In this section, we highlight two sources of gender bias in co-reference systems that can cause them to fail WinoBias: training data and auxiliary resources and propose strategies to mitigate them.

Training Data Bias
Bias in OntoNotes 5.0 Resources supporting the training of co-reference systems have severe gender imbalance.In general, entities that have a mention headed by gendered pronouns (e.g."he", "she") are over 80% male.5 Furthermore, the way in which such entities are referred to, varies significantly.Male gendered mentions are more than twice as likely to contain a job title as female mentions. 6Moreover, these trends hold across genres.
Gender Swapping To remove such bias, we construct an additional training corpus where all male entities are swapped for female entities and vice-versa.Methods can then be trained on both original and swapped corpora.This approach maintains non-gender-revealing correlations while eliminating correlations between gender and coreference cues.
We adopt a simple rule based approach for gender swapping.First, we anonymize named entities using an automatic named entity finder (Lample et al., 2016).Named entities are replaced consistently within document (i.e."Barak Obama ... Obama was re-elected." would be annoymized to "E1 E2 ... E2 was re-elected." ).Then we build a dictionary of gendered terms and their realization as the opposite gender by asking workers on Amazon Mechnical Turk to annotate all unique spans in the OntoNotes development set. 7ules were then mined by computing the word difference between initial and edited spans.Common rules included "she → he", "Mr." → "Mrs.","mother" → "father."Sometimes the same initial word was edited to multiple different phrases: these were resolved by taking the most frequent phrase, with the exception of "her → him" and "her → his" which were resolved using part-ofspeech.Rules were applied to all matching tokens in the OntoNotes.We maintain anonymization so that cases like "John went to his house" can be accurately swapped to "E1 went to her house."

Resource Bias
Word Embeddings Word embeddings are widely used in NLP applications however recent work has shown that they are severely biased: "man" tends to be closer to "programmer" than "woman" (Bolukbasi et al., 2016;Caliskan et al., 2017).Current state-of-art co-reference systems build on word embeddings and risk inheriting their bias.To reduce bias from this resource, we replace GloVe embeddings with debiased vectors (Bolukbasi et al., 2016).
Gender Lists While current neural approaches rely heavily on pre-trained word embeddings, previous feature rich and rule-based approaches rely on corpus based gender statistics mined from external resources (Bergsma and Lin, 2006).Such lists were generated from large unlabeled corpora using heuristic data mining methods.These resources provide counts for how often a noun phrase is observed in a male, female, neutral, and plural context.To reduce this bias, we balance male and female counts for all noun phrases.

Results
In this section we evaluate of three representative systems: rule based, Rule, (Raghunathan et al., 2010), feature-rich, Feature, (Durrett and Klein, 2013), and end-to-end neural (the current state-ofthe-art), E2E, (Lee et al., 2017).The following sections show that performance on WinoBias reveals gender bias in all systems, that our methods remove such bias, and that systems are less biased on OntoNotes data.
WinoBias Reveals Gender Bias Table 2 summarizes development set evaluations using all three systems.Systems were evaluated on both types of sentences in WinoBias (T1 and T2), separately in pro-stereotyped and anti-stereotyped conditions ( T1-p vs. T1-a, T2-p vs T2-a).We evaluate the effect of named-entity anonymization (Anon.),debiasing supporting resources 8 (Re- sour.) and using data-augmentation through gender swapping (Aug.).E2E and Feature were retrained in each condition using default hyperparameters while Rule was not debiased because it is untrainable.We evaluate using the coreference scorer v8.01 (Pradhan et al., 2014) and compute the average (Avg) and absolute difference (Diff) between pro-stereotyped and antistereotyped conditions in WinoBias.
All initial systems demonstrate severe disparity between pro-stereotyped and anti-stereotyped conditions.Overall, the rule based system is most biased, followed by the neural approach and feature rich approach.Across all conditions, anonymization impacts E2E the most, while all other debiasing methods result in insignificant loss in performance on the OntoNotes dataset.Removing biased resources and data-augmentation reduce bias independently and more so in combination, allowing both E2E and Feature to pass WinoBias without significantly impacting performance on either OntoNotes or WinoBias .Qualitatively, the neural system is easiest to de-bias and our approaches could be applied to future end-to-end systems.Systems were evaluated once on test sets, Table 3, supporting our conclusions.

Systems Demonstrate Less Bias on OntoNotes
While we have demonstrated co-reference systems have severe bias as measured in WinoBias , this is an out-of-domain test for systems trained on OntoNotes.Evaluating directly within OntoNotes is challenging because sub-sampling documents with more female entities would leave very few evaluation data points.Instead, we apply our gender swapping system (Section 3), to the OntoNotes development set and compare system performance between swapped and unswapped data.9If a system shows significant difference between original and gender-reversed conditions, then we would consider it gender biased on OntoNotes data.
Table 4 summarizes our results.The E2E system does not demonstrate significant degradation in performance, while Feature loses roughly 1.0-F1.10This demonstrates that given sufficient alternative signal, systems often do ignore gender biased cues.On the other hand, WinoBias provides an analysis of system bias in an adversarial setup, showing, when examples are challenging, systems are likely to make gender biased predictions.
Machine learning methods are designed to generalize from observation but if algorithms inadvertently learn to make predictions based on stereotyped associations they risk amplifying existing social problems.Several problematic instances have been demonstrated, for example, word embeddings can encode sexist stereotypes (Bolukbasi et al., 2016;Caliskan et al., 2017).Similar observations have been made in vision and language models (Zhao et al., 2017), online news (Ross and Carter, 2011), web search (Kay et al., 2015) and advertisements (Sweeney, 2013).In our work, we add a unique focus on co-reference, and propose simple general purpose methods for reducing bias.
Implicit human bias can come from imbalanced datasets.When making decisions on such datasets, it is usual that under-represented samples in the data are neglected since they do not influence the overall accuracy as much.For binary classification Kamishima et al. (2012Kamishima et al. ( , 2011) ) add a regularization term to their objective that penalizes biased predictions.Various other approaches have been proposed to produce "fair" classifiers (Calders et al., 2009;Feldman et al., 2015;Misra et al., 2016).For structured prediction, the work of Zhao et al. (2017) reduces bias by using corpus level constraints, but is only practical for models with specialized structure.Kusner et al. (2017) propose the method based on causal inference to achieve the model fairness where they do the data augmentation under specific cases, however, to the best of our knowledge, we are the first to propose data augmentation based on gender swapping in order to reduce gender bias.
Concurrent work (Rudinger et al., 2018) also studied gender bias in coreference resolution systems, and created a similar job title based, winograd-style, co-reference dataset to demonstrate bias 11 .Their work corroborates our findings of bias and expands the set of systems shown to be biased while we add a focus on debiasing methods.Future work can evaluate on both datasets.

Conclusion
Bias in NLP systems has the potential to not only mimic but also amplify stereotypes in society.For a prototypical problem, coreference, we provide a method for detecting such bias and show that 11 Their dataset also includes gender neutral pronouns and examples containing one job title instead of two.three systems are significantly gender biased.We also provide evidence that systems, given sufficient cues, can ignore their bias.Finally, we present general purpose methods for making coreference models more robust to spurious, genderbiased cues while not incurring significant penalties on their performance on benchmark datasets.

Figure 1 :
Figure 1: Pairs of gender balanced co-reference tests in the WinoBias dataset.Male and female entities are marked in solid blue and dashed orange, respectively.For each example, the gender of the pronominal reference is irrelevant for the co-reference decision.Systems must be able to make correct linking predictions in pro-stereotypical scenarios (solid purple lines) and anti-stereotypical scenarios (dashed purple lines) equally well to pass the test.Importantly, stereotypical occupations are considered based on US Department of Labor statistics.
Type 1: [entity1] [interacts with] [entity2] [conjunction][pronoun][circumstances].Prototypical WinoCoRef style sentences, where co-reference decisions must be made using world knowledge about given circumstances (Figure1; Type 1).Such examples are challenging because they contain no syntactic cues.Type 2: [entity1] [interacts with] [entity2] and then [interacts with] [pronoun] for [circumstances].These tests can be resolved using syntactic information and understanding of the pronoun (Figure1; Type 2).We expect systems to do well on such cases because both semantic and syntactic cues help disambiguation.

Table 1 :
Occupations statistics used in WinoBias dataset, or-

Table 2 :
(Graham et al., 2014)inoBias development set.WinoBias results are split between Type-1 and Type-2 and in pro/anti-stereotypical conditions.*indicates the difference between pro/anti stereotypical conditions is significant (p < .05)underanapproximate randomized test(Graham et al., 2014).Our methods eliminate the difference between pro-stereotypical and anti-stereotypical conditions (Diff), with little loss in performance (OntoNotes and Avg).

Table 3 :
F1 on OntoNotes and Winobias test sets.Methods were run once, supporting development set conclusions.

Table 4 :
Performance on the original and the gender- reversed developments dataset (anonymized).