Towards Understanding Gender Bias in Relation Extraction

Recent developments in Neural Relation Extraction (NRE) have made significant strides towards Automated Knowledge Base Construction. While much attention has been dedicated towards improvements in accuracy, there have been no attempts in the literature to evaluate social biases exhibited in NRE systems. In this paper, we create WikiGenderBias, a distantly supervised dataset composed of over 45,000 sentences including a 10% human annotated test set for the purpose of analyzing gender bias in relation extraction systems. We find that when extracting spouse-of and hypernym (i.e., occupation) relations, an NRE system performs differently when the gender of the target entity is different. However, such disparity does not appear when extracting relations such as birthDate or birthPlace. We also analyze how existing bias mitigation techniques, such as name anonymization, word embedding debiasing, and data augmentation affect the NRE system in terms of maintaining the test performance and reducing biases. Unfortunately, due to NRE models rely heavily on surface level cues, we find that existing bias mitigation approaches have a negative effect on NRE. Our analysis lays groundwork for future quantifying and mitigating bias in NRE.


Introduction
With the wealth of information being posted online daily, Relation Extraction (RE) has become increasingly important.RE aims specifically to extract relations from raw sentences and represent them as succinct relation tuples of the form (head, relation, tail).An example is (Barack Obama, spouse, Michelle Obama).
The concise representations provided by RE models have been used to extend Knowledge Bases (KBs) (Subasic et al., 2019;Trisedya et al., 2019).These KBs are then used heavily in NLP systems, such as Task-Based Dialogue Systems.In recent years, much focus in the NRE community has been centered on improvements in model * Equal Contribution.
In this paper, we take the first step at understanding and evaluating gender bias in NRE systems.We analyze gender bias by measuring the differences in model performance when extracting relations from sentences written about females versus sentences written about males.Significant discrepancies in performance between genders could diminish the fairness of systems and distort outcomes in applications that use them.For example, if a model predicts the occupation relation for with higher recall for male entities, this could lead to KBs having more occupation information for males.Downstream search tasks using that KB could produce biased predictions, such as ranking articles about female computer scientists below articles about their male peers.
We provide the first evaluation of social bias in NRE models; specifically, we evaluate gender bias in English language predictions of a collection of popularly used and open source NRE models 1 (Lin et al., 2016;Wu et al., 2017;Liu et al., 2017;Feng et al., 2018).We evaluate OpenNRE on two fronts: (1) examining Equality of Opportunity (Hardt et al., 2016) when OpenNRE is trained on an unmodified dataset and (2) examining the effect that various debiasing options (Bolukbasi et al., 2016;Rudinger et al., 2018;Zhao et al., 2018a;Lu et al., 2018;Kiritchenko and Mohammad, 2018) have on both absolute F1 score and the difference in F1-scores on male and female data-points.
However, carrying out such an evaluation is difficult with existing NRE datasets, such as the NYT dataset from Riedel et al. (2010), because there is no reliable way to obtain gender information about the entities.Thus, we create a new dataset specifically aimed at evaluating gender bias for NRE, just as prior work has done for other tasks like Coreference Resolution (Zhao et al., 2018b;Rudinger et al., 2018).We call our dataset WikiGenderBias and make it publicly available.Our contributions are as such: • WikiGenderBias is the first dataset aimed at training and evaluating NRE systems for gender bias.It contains ground truth labels for the test set and about 45,000 sentences in total.
• We provide the first evaluation of NRE systems for gender bias and find that it exhibits gender bias.
• We demonstrate that using both genderswapping and debiased embeddings effectively mitigates bias in the model's predictions and that using genderswapping improves the model's performance when the training data contains contextual biases.

Related Work
The study of gender bias in NLP is still nascent; gender bias has not been studied in many NLP tasks.Typically, prior work first observes the gender bias, then attempts to mitigate it (Sun et al., 2019).In this paper, we undertake that first step of observation for the task of RE.Since a form of measurement is required for observation, prior work has created methods for measuring gender bias (Zhao et al., 2017;Rudinger et al., 2018;Zhao et al., 2018a;Dixon et al., 2018;Lu et al., 2018;Kiritchenko and Mohammad, 2018;Romanov et al., 2019).Gender bias has been measured mainly in training sets and in predictions.Measuring the latter is simple: measure the difference in performance of the model on male and female datapoints, with the definition of the gender of a datapoint being domaindependent (Lu et al., 2018;Kiritchenko and Mohammad, 2018).Other metrics have been proposed to evaluate fairness of predictors and allocative bias (Dwork et al., 2012;Hardt et al., 2016) (Zhao et al., 2018a) and Name Anonymization (Zhao et al., 2018a)) and a word embedding debiasing method (Hard Debiasing (Bolukbasi et al., 2016)) and analyze their affect on bias in predictions of NRE models.
In RE, using supervised machine learning models has become popular.Training data for these models is typically obtained using the Distant Supervision or a variation: for a given relation (e1, r, e2) in a KB, assume any sentence that contains both e1 and e2 expresses r (Mintz et al., 2009).Many NRE models focus on mitigating the effects of noise in the training data introduced by Distant Supervision to increase performance (Hoffmann et al., 2011;Surdeanu et al., 2012;Lin et al., 2016;Liu et al., 2017;Feng et al., 2018).Recent work uses KBs to further increase NRE performance (Vashishth et al., 2018;Han et al., 2018).Despite these significant efforts towards improving NRE performance, there are no studies on bias or ethics in NRE to our knowledge.We provide such a study.

WikiGenderBias
To evaluate gender bias in RE models, we need some measure of how gender affects predictions in RE models.To obtain this, we need some way to identify gender in test instances.Current datasets for RE lack gender information for entities.To ob-tain gender information for current datasets could be costly or impossible.Thus, we elected to create WikiGenderBias with this gender information.Specifically, we wanted to measure how predictions differed on sentences from Wikipedia articles about male entities versus those about female entities.Since most data about an entity in a KB is generated from that entity's page, if an NRE model performed better for male articles, then likely male entities would have more information in a KB.This bias could propagate to downstream predictions for models using the KB, so for that reason evaluating performance differences on articles about entities of different genders is useful.Wiki-GenderBias's splits are given in Table 1.

Dataset Creation
To generate WikiGenderBias, we use a variant of the Distant Supervision assumption: for a given relation between two entities, assume that any sentence from an article written about one of those entities that mentions the other entity expresses the relation.For instance, if we know (Barack, spouse, Michelle) is a relation and we find the sentence He and Michelle were married in Barack's Wikipedia article, then we assume that sentence expresses the (Barack, spouse, Michelle) relation.This assumption is similar to that made by Mintz et al. (2009) and allows us to scalably create the dataset.
We use Wikpedia because many entities on Wikipedia have gender information and because Wikipedia contains articles written about these entities.This combined with relation information about these entities obtainable from DBPedia, Wikipedia's KB, allowed us to create WikiGender-Bias using our variant of the Distant Supervision assumption.
In WikiGenderBias, we use four relations: spouse, hypernym, birthDate, and birthPlace.We chose from a given set of relations stored in DB-Pedia.We hypothesized that models might use gender as a proxy to influence predictions for spouse and hypernym relations, since words pertaining to marriage are more often mentioned in female articles and words pertaining to hypernym (which is similar to occupation) are more often mentioned in articles about males (Wagner et al., 2015;Graells-Garrido et al., 2015).We hypothesized that birthDate and birthPlace would operate like control groups and believed gender would cor-relate with neither relation.We also generate negative examples for these four relations by obtaining datapoints for three unrelated relations: parents, deathDate, and almaMater.
We use entities for which we could obtain data for all four relations.We set up our experiment such that head entities are not repeated across the train, dev, and test sets so that the model will see only new head entities at testing time.Since we obtain the distantly supervised sentences for a relation from the head entity's article, this guarantees the model will not see sentences from the same article across datasets.However, it is possible that head entity will appear as a tail entity in other relations, so entities could appear in multiple datasets.
WikiGenderBias's gender splits are given in Table 1.We first train OpenNRE on the raw, gender-imbalanced training data to reflect model performance without modification.We then introduce bias mitigation methods such as genderswapping, name anonymzation, and hard debiasing from prior work to evaluate the tradeoff between model performance and gender parity.

Test Sets
We partition the test set into two subsets: one with sentences from female articles, and one with sentences from male articles (see Table 1).We collect data using our variant of the distant supervision assumption (see Section 3.1).However, as noted earlier, some sentences can be noisy.Evaluating models on noisy data is unfair since a model could be penalized for correctly predicting the relation is not expressed in the sentence.Thus, we had to obtain ground truth labels.
To find the ground truth, we collected annotations from AMT workers.We asked these workers to determine whether or not a given sentence expressed a given relation.If the majority answer was no, then we labeled that sentence as expressing no relation.(We denote no relation as NA in WikiGenderBias.)Each sentence was annotated by three different workers.Each worker was paid 15 cents per annotation.We only accepted workers from England, the US or Australia and with HIT Approval Rate greater than 95% and Number of HITs greater than 100.We found the pairwise inter-annotator agreement as measured by Fleiss' Kappa (Fleiss, 1971) κ to be 0.44, which is consistent across both genders and signals moderate agreement.We note that our κ value is affected by asking workers to make binary classifications, which limits the degree of agreement that is attainable above chance.We also found the pairwise inter-annotator agreement to be 84%.

Further Analysis
In our creation of WikiGenderBias, we performed some statistical analysis on the Wikipedia data we obtained.We build on the work of Graells-Garrido et al. (2015), who discover that a higher proportion of Wikipedia Infoboxes on Wikipedia pages of female entities have spouse information than Wikipedia Infoboxes on Wikipedia pages of male entities.However, Figure 1 demonstrates a further discrepancy: that amongst articles for females and males which contain spouse information, articles written about females mention females' spouses far more often than articles written about men.Additionally, we show that amongst female and male articles we sampled, hypernyms are mentioned far more often in male than female articles.
That female articles mention the females' spouses more often than male articles indicates gender bias in Wikipedia's composition; authors do not write about the two genders equally.

Experimental Setup
We evaluate NRE models from a popular opensource code repository called OpenNRE (Han et al., 2019).OpenNRE models combine methods including usage of selective attention to add weight to sentences with relevant information (Lin et al., 2016) as well as methods to reduce noise at an entity-pair level (Lin et al., 2016) and innovations in adversarial training of NRE models (Wu et al., 2017).OpenNRE allows users to choose a selector (Attention or Average) and an encoder (PCNN, CNN, RNN, or Bi-RNN) for each model.Each of these models requires word embeddings to create distributed representations of sentences.It should be noted that a PCNN is simply a CNN which has a piecewise max-pooling operation, where the sentence is split into three sections based on the positions of the head and tail entities (Zeng et al., 2015).
As mentioned in Section 1, we use OpenNRE and performance differences to evaluate the models.

Parameters for Equality of Opportunity
We train every encoder-selector combination on the WikiGenderBias training set and use Word2Vec embeddings (Mikolov et al., 2013) also trained on WikiGenderBias and test each combination on the WikiGenderBias test set. 2e also utilize Equality of Opportunity (Hardt et al., 2016).In our case A = {male, f emale}, because gender is our protected attribute and we assume it to be binary.We evaluate EOP on a per-relation, one-versus-rest basis.Thus, we calculate one EOP where spouse is the positive class and all other classes are negative; in this case, Y = 1 corresponds to the true-label being spouse and Y = 0 corresponds to the true label being hypernym, birthDate, birthPlace, or NA.We then do another calculation for each relation where Y = 1 corresponds to that relation being expressed and Y = 0 corresponds to any other relation being expressed.Note that this is equivalent to measuring per-relation recall for each gender.

Bias Mitigation Methods
Then, we evaluate the PCNN, Attention model using the debiasing methods mentioned below.
The contexts in which males and females are written about can differ; for instance, on Wikipedia women are more often written about with words related to sexuality than men (Graells-Garrido et al., 2015).Counterfactual Data Augmentation (CDA) mitigates these contextual biases.CDA consists of replacing masculine words in a sentence with their corresponding feminine words and vice versa for all sentences in a corpus, then training on the union of the original and aug-  Table 2: Equality of Opportunity results from running combinations of encoders and selectors of the OpenNRE model for the male and female genders of each relation.A positive difference means a higher prediction recall for male entities.A predictor would satisfy Equality of Opportunity if and only if the difference were 0, and a fair predictor should have close to 0 difference.mented corpora3 .This equalizes the contexts for feminine and masculine words; if previously 100 doctors were referred to as he and 50 as she, in the new training set he and she will refer to doctor 150 times each.Sometimes, models use entity names as a proxy for gender; if a model associates females with politician and John with males, then it might be less likely to predict that John is a politician expresses (John, hypernym, politican) than it would if it associated John with females.Name Anonymization (NA) mitigates this.NA consists of finding all person entities with a Named Entity Recognition system (Finkel et al., 2005) then replacing the names of these entities with corresponding anonymizations.For instance, the earlier example might become E1 is a politcian, thereby preventing the model from using names as a proxy for gender.
Word embeddings can encode gender biases (Bolukbasi et al., 2016;Caliskan et al., 2017;Garg et al., 2018) and this can affect bias in downstream predictions for models using the embeddings (Zhao et al., 2018a).Hard-Debiasing mitigates gender bias in embeddings.Hard-Debiasing involves finding a direction representing gender in the vector space, then removing the component on that direction for all gender-neutral words, then equalizing the distance from that direction for all (masculine, feminine) word pairs (Bolukbasi et al., 2016).We applied hard-debiasing to Word2Vec embeddings (Mikolov et al., 2013) we trained on the sentences in WikiGenderBias.Every time we applied CDA or NA or some combination of the two, we trained a new embedding model on that debiased dataset as well.
As mentioned in Section 2, gender bias can be measured as the difference in a performance metric for a model when evaluated on male and female datapoints.We evaluate the effect these methods have on NRE models using this.We define male (female) datapoints to be relations for which the head entity is male (female), which means the distantly supervised sentence is taken from a male (female) article.Prior work has used area under the precision-recall curve and F1 score to measure NRE model performance (Gupta et al., 2019;Han et al., 2019;Kuang et al., 2019); following prior work, we use F1 score as our performance metric.

Evaluation of Equality of Opportunity
OpenNRE models do not satisfy Equality of Opportunity, although they get close (see Table 2).Predictions on birthDate satisfy Equality of Opportunity the least in all case except in the case of RNN with Attention, when predictions on spouse were the most biased.Notably, Bi-RNN with Average almost perfectly satisfies Equality of Opportunity for spouse.
We also find that Average selectors do slightly better than Attention selectors for preventing bias.For every encoder except PCNN, architectures using the Average selector exhibited significantly less gender bias than models using the same encoder and the Attention selector.In the case of BiRNN, F1 gap for predictions on spouse with the Average selector were less than half the gap for the Attention selector.Average selectors do not provide as dramatic an improvement for Equality of Opportunity and actually increase gender bias for hypernym.All architectures have similar levels of bias, but by the F1 metric CNN with Average selector seems to have mitigated bias in the spouse relation the best while Bi-RNN with Average selector does best by the Equality of Opportunity metric.
It is also worth noting that the average selector performed slightly better than the attention selector across the board, which is intriguing considering that the average selector is used as a baseline since it weights sentences in the training data equally for each relation.

Evaluating Bias Mitigation Methods
F1 scores between predictions on male and female sentences on all relations differ for every encoder selector combinations, although the difference is relatively small (see the leftmost column in 2).We find that predictions on spouse typically exhibit the highest difference in F1 score, as we predicted.However, surprisingly, predictions on pernym exhibit the least gender bias and predictions on birthPlace exhibit more significant gender bias than predictions on spouse in some cases.Predictions on birthDate exhibited very little gender bias, as predicted.
Name Anonymization surprisingly substantially increases F1 score gap for the hypernym relation, but slightly decreases F1 score gap for all other relations.Name Anonymization appears to be effective at debiasing all relations aside from hypernym, though not as effective as either Gender-Swapping or using Debiased Embeddings.These results indicate that entity bias likely does not contribute very much the gender bias in the models' original predictions.
Hard-Debiased Word Embeddings was also extremely effective at mitigating the difference in F1 scores for all relations.While gender-swapping did slightly better at decreasing that difference for the spouse relation, debiased embeddings mitigated bias better for the birthDate and hypernym relations.We note that using debiased embeddings increases absolute scores just like genderswapping, though it increases them slightly less.
Gender-Swapping substantially decreases F1 score gap for the spouse relation as well as for all other relations (see Figure 2).Interestingly, the absolute F1 scores for both male and female sentences for all relations increased when Gender-Swapping was applied (see Figure 3).Thus, gender-swapping is extremely effective not only for mitigating bias but also for improving performance.This is likely due to two things: 1) Wikipedia data is rife with contextual gender bias and 2) gender-swapping successfully removes those biases.Prior work has shown that many corpora contain similar biases, including even news articles like those from Google News (Bolukbasi et al., 2016).Our results show that gender-swapping may be an effective tool to combat this context bias in the domain of NRE.
Combinations Combining debiased embeddings and gender swapping turned out to have the highest relative difference in F1 score between male and female sentences for spouse while also reducing bias in other relations (see Figure 3).All models which use name anonymization (Models 1-4) have significantly higher F1 score gaps for the hypernym relation.While all combinations reduced gender bias to varying extents, gender bias in the spouse relation was mitigated to a similar extent by all combinations.Surprisingly, applying gender-swapping on its own reduces gender bias about as well or better as any combination of methods.
Aggregate Results Thus, throughout all combinations of debiasing options, the PCNN with Attention model attains better F1 score for the spouse relation when predicting on male sentences than for female sentences.For birthplace, F1 score gap is far lower as we predicted.To our surprise, F1 score gap was lowest for hypernym, which we predicted would have a higher gap like that for spouse.Also surprisingly, F1 gap for birthPlace was almost as high as that for spouse.While all the model exhibited bias in predictions for all relations, we note that using gender-swapping and debiasing embeddings were able to significantly mitigate the gap in F1 scores for the model's predictions on male and female sentences.However, while the F1 score gap for birthPlace responded strongly to debiasing methods, spouse did not respond as strongly.Gender-swapping was able to bolster the model's absolute F1 scores as well.Thus, we note that mitigating context bias worked extremely well in this case.Name anonymization was as effective and actually increased gender bias for hypernym; it seems removing entity bias increased F1 score gap for hypernym.We note that the best combination for both bias mitigation and absolute model performance was using gender-swapping on its own.

Conclusion
In our study, we create WikiGenderBias: the largest dataset for gender bias evaluation to date across all NLP tasks to our knowledge.We train OpenNRE models on the WikiGenderBias dataset and test them on gender-separated test sets.We find a substantial difference in F1 scores for the spouse relation between predictions on male sentences and female sentences for all OpenNRE model architectures.We find that this gender bias can be substantially mitigated merely by doing pre-processing on the dataset and the word embeddings utilized by the models, and find that the best debiasing combination was gender-swapping paired with debiased embeddings.We also note that this combination significantly increases the model performance in general as well.Finally, we build on Graells-Garrido et al. ( 2015)'s work and find further context bias latent in Wikipedia.
While these findings will help future work avoid gender biases, this study is preliminary.We only consider binary gender, but future work should consider non-binary genders.Additionally, future work should further probe the source of gender bias in the model's predictions, perhaps by visualizing attention or looking more closely at the model's outputs.

Figure 1 :
Figure 1: Proportion of sentences corresponding to a given relation over total sentences extracted to Wiki-GenderBias for each entity.This demonstrates that, of the entities in WikiGenderBias, there are many more sentences expressing the spouse relation for females than males.

Figure 2 :
Figure 2: Trade-off between relation extraction model performance as measured by aggregate F1 score over both male and female genders (left) and male − f emale F1 score gender gap (right).This is evaluated on the model with No Debiasing (ND) and three bias mitigation methods: Name Anonymization (NA), Debiased Embeddings (DE) and Gender Swapping (GS).An ideal algorithm maximizes aggregate F1 score while minimizing the gender gap.

Figure 3 :
Figure 3: OpenNRE results on WikiGenderBias with different input combinations to the model.The y-axis indicates the difference in F1 score between genders (male-female). ,