Is Wikipedia succeeding in reducing gender bias? Assessing changes in gender bias in Wikipedia using word embeddings

Large text corpora used for creating word embeddings (vectors which represent word meanings) often contain stereotypical gender biases. As a result, such unwanted biases will typically also be present in word embeddings derived from such corpora and downstream applications in the field of natural language processing (NLP). To minimize the effect of gender bias in these settings, more insight is needed when it comes to where and how biases manifest themselves in the text corpora employed. This paper contributes by showing how gender bias in word embeddings from Wikipedia has developed over time. Quantifying the gender bias over time shows that art related words have become more female biased. Family and science words have stereotypical biases towards respectively female and male words. These biases seem to have decreased since 2006, but these changes are not more extreme than those seen in random sets of words. Career related words are more strongly associated with male than with female, this difference has only become smaller in recently written articles. These developments provide additional understanding of what can be done to make Wikipedia more gender neutral and how important time of writing can be when considering biases in word embeddings trained from Wikipedia or from other text corpora.


Introduction
Word embeddings are vectors that represent the meaning of words and their relation. They are the cornerstone of many NLP techniques. For example, word embeddings can be used to search in documents, to analyze sentiment and to classify documents [Mikolov et al., 2013a, Nalisnick et al., 2016, Parikh et al., 2018, Jang et al., 2019. These embeddings are typically created using unsupervised learning from a large corpus of text [Krishna and Sharada, 2019].
Large corpora of text used for training word embeddings may contain stereotypical biases. Word embeddings can then inherit these biases [Mikolov et al., 2013a, Caliskan et al., 2017, Jones et al., 2020. For example, stereotypical words such as 'marriage' can be more strongly associated with female words than male words. In fact, changes in word embedding can be useful for detecting minor changes in the meaning of words at small time scales [Kutuzov et al., 2018].
Biases in word embeddings may, in turn, have unwanted consequences in applications. Bolukbasi et al. [2016] show that when embeddings are used to improve search results, biased embeddings can lead to biased results. As an example, scientific research with male names may be ranked higher if male names have a stronger association with the scientific search words [Bolukbasi et al., 2016].
Another example of a downstream application with unwanted gender bias consequences is machine translation. When translating a sentence from a language with a gender neutral pronoun to English, a sentence about a nurse may be translated with a female pronoun while a sentence with the word engineer may be translated with a male pronoun [Prates et al., 2019]. Such stereotypical translations can be avoided by using a more gender neutral embedding [Font and Costa-Jussa, 2019]. Bolukbasi et al. [2016] have already proposed a method for debiasing word embeddings. However, it has been hypothesized that debiasing covers up biases instead of removing them [Gonen and Goldberg, 2019]. Stereotypical words remain clustered in the debiased embeddings and thus there is still a risk for algorithmic discrimination [Gonen and Goldberg, 2019]. A more robust debiasing procedure is yet to be proposed.
Gender bias, as measured in word embeddings trained on books, has been shown to decrease over time up to the year 2000 [Jones et al., 2020, Garg et al., 2018. Whether the decreasing trend has con-tinued in more recent years has not been tested. If bias has continued to decrease, a straightforward way to obtain less biased word embeddings would be to train word embeddings on more recent corpora of text. To investigate this issue, we will measure gender bias in one of the largest openly available text corpora: Wikipedia. Wagner et al. [2015] already showed the presence of gender bias in Wikipedia. The editors of Wikipedia have actively tried to reduce this bias since 2013 [Wikipedia contributors, 2020a]. Our research can be used to evaluate the effectiveness of these efforts, and may inspire new strategies to reduce bias further. Towards that end, we will answer the question: 'How does gender bias in word embeddings from Wikipedia develop over the years 2006-2020?'.
Contributions: 1. We extend the work of Jones et al. [2020] and Garg et al. [2018] by looking at more recent years and applying their methods to the corpus of Wikipedia.
2. Our work provides insight in how gender bias has developed in Wikipedia using four categories. So far, most research into this is static. Our research shows to what extent the efforts of Wikipedia editors were successful, while also providing possible improvements on their current strategy.
3. We illustrate that year of retrieval is important for gender bias in the word embeddings from Wikipedia. If gender neutrality w.r.t. a domain is important, our results suggest what year to use.

Gender Bias in Wikipedia
In 2011, a big survey on the demographics of Wikipedia editors showed that less than 15% of Wikipedia editors are female [Collier and Bear, 2012]. This led to further investigations into the impact on content of Wikipedia considering different dimensions of gender bias. Two important dimensions of gender bias as researched by Wagner et al. [2015] are coverage bias and lexical bias.
Coverage bias means that notable women are not covered as well as notable men. For example, a smaller percentage of notable women have their own Wikipedia page or these pages may be less extensive. Wagner et al. [2015] looked at three data sets of notable people and found no coverage bias. However, later research by Wagner et al. [2016] did show a small glass ceiling effect. Google search trends were used to assess the notability of people covered on Wikipedia. Women on Wikipedia were found to be more notable than men on average, which suggests that women have to be more notable to be covered on Wikipedia. The efforts of Wikipedia editors have mostly focused on this coverage bias, specifically by making lists of missing notable women and creating articles for these women [Wikipedia contributors, 2020b]. In terms of gender associations in word embeddings, this may have caused words that are commonly used in these biographies to have become more female associated.
Lexical bias relates to the words used on pages written about women and men. Wagner et al. [2016] found two significant differences. Words related to family and relationships are more present in female articles compared to male articles. An article about a divorced person is 4.4 times more likely to be about a woman. The second difference is a stronger emphasis on gender. Articles about women contain more words that are genderspecific, such as 'female' or 'woman'. This can cause biases in the word embeddings. When biographies about women for example contain phrases as 'female scientist', whereas men are referred to as 'scientist', the word scientist would be more closely associated to female, despite there being both male and female scientists.
Besides this, there has also been research to the development of the gender proportion in the Wikipedia biographies. This has been recorded since 2014 and since 2017 this has also been measured by occupation (see Figure 1) [Konieczny and Klein, 2018].
The biggest change can be seen for the occupation 'manager', for which the percentage of female biographies increased with more than 5% in the last 3 years. However, this is still below average. The occupation artist has a female percentage far above average with almost 30%. Furthermore, the overall fraction of female biographies has increased steadily towards around 18% [Envel Le Hir, 2017-2020. Thus matters are improving, but women are generally still less represented in Wikipedia.

Word Embedding Association Test
As proposed by Caliskan et al. [2017], we use the Word Embedding Association Test (WEAT) to quantify gender bias. This test uses four categories that are considered stereotypical towards gender: Arts, Science, Family and Career [Caliskan et al., 2017]. These categories have shown significant bias towards male or female words in embeddings from Google News corpora [Mikolov et al., 2013a], Google Books [Jones et al., 2020], as well as a 'Common Crawl' corpus [Caliskan et al., 2017]. Each category C has a set of eight words and there are two sets (M and F ) of target words relating to male and female respectively ( Table 7 in the Appendix). These words are based on an implicit association test also used in psychology [Caliskan et al., 2017].
The WEAT score is computed as follows: the association between a pair of words with vectors v 1 and v 2 is measured by the cosine similarity: Let v c denote a word from category C, v m a malespecific word (e.g. "he" or "his") and v f a femalespecific word (e.g. "she" or "her"). First, the gender bias per word is calculated using equation (2) Here, a negative value indicates the category word is female biased and a positive value indicates a male bias. This score is averaged over all words in the category C to get the bias score b(C), We chose to use WEAT since it is a popular way to measure bias in word embeddings and it allows us to compare our results to those of Jones et al. [2020]. This test will show whether these words contain differences in association with male and female, but how these differences relate to negative consequences in different applications is not precisely known. The results should be interpreted in this general sense, as it shows the existence of bias, but not how problematic the gender bias is.

Experimental Setup
All code and the models used for the experiments are made publicly available 1 .
Data and preprocessing. We obtained full copies of all articles on Wikipedia in 2006, 2008 to 2010 and 2014 to 2020 from dumps.wikimedia.org and archive.org. To make a comparison between full Wikipedia backups and newly added articles, we created a second corpus by taking all articles for which the ID was not present on Wikipedia two years before. For example, to create a corpus for 2020, we removed all articles that were added before 2019. All articles were converted to tokens using the build-in functionality from the gensim library [Řehůřek and Sojka, 2010]. This tool removes all articles shorter than 50 words, next to all markup, comments and punctuation.
Training of word embeddings. The word2vec model was used to train word embeddings [Mikolov et al., 2013a].
This model uses Continuous-bag-of-words to obtain word vectors that represent the word semantics as well as possible [Mikolov et al., 2013a]. Vectors that are closer together in the vector space represent words that cooccur more often. We mostly used the default settings for word2vec as provided by gensim [Řehůřek and Sojka, 2010]. However, we did not remove the 5% most common words, because this would also remove the words 'he' and 'she'. To ensure that the training had sufficiently converged, we calculated the bias after training for one, ten and twenty iterations (epochs), besides the standard of five.
Quality of embeddings. We used the Word-Sim353 benchmark to assess the quality of word embeddings [Finkelstein et al., 2001]. This evaluation looks at the similarity of 353 word pairs and evaluates the correlation between the results of the embeddings and the true similarity as defined by humans. We used this as a sanity check to assess whether the word embeddings reasonably embed true word semantics. These correlation scores can be found in Table 8 in the Appendix, they are all between .63 and .66. This is comparable to the correlations between .60 and .67 that were found using word2vec by Jatnika et al. [2019], which is already better than the model trained by Google they used as comparison [Mikolov et al., 2013b]. As may be expected with a smaller corpus, the scores for the data set of new articles are slightly lower (between .59 and .64), but still reasonable.
Significance of change in WEAT score. We performed a linear regression on the WEAT score versus time. We measured whether the change in WEAT score is significant by performing a t-test to compute whether the slope is significantly different from zero. To reduce the amount of false discoveries from multiple testing, we use a Benjamini-Hochberg correction with a False Discovery Rate (FDR) of 5% [Benjamini and Hochberg, 1995].
Significance against random words. A significant change in WEAT scores may not tell the whole story. It could be the case that, for some reason, all word vectors in the vocabulary become more similar to male or female words. To exclude this possibility, we also computed WEAT scores of random words, using a method proposed in the code from Jones et al. [2020]. We performed a regression on these WEAT scores for many different groups of random words to obtain a histogram of slopes. This histogram of slopes indicates the distribution of slopes for random words. We can then inspect how likely it is for a word category (such as Arts) to have the observed slope, and to see whether the slope is significantly different from slopes of random words. To this end, we used a sample of 1000 random word sets and counted how many of these slopes are at least as extreme as the observed one to determine a permutation p-value for the category word set. On these p-values we did another Benjamini-Hochberg correction with the same FDR of 5%.
Deviation of gender bias within a category. The WEAT score used to quantify the gender bias is a mean over several words in a category. It could be the case that one of the words of a word category influences the mean more than others (e.g. as an outlier). This could indicate either that a word in a word category is inappropriate, thus indicating a problem with the WEAT test. Alternatively, it can indicate where Wikipedia editors should focus their efforts on changing the language in the articles to reduce the measured gender bias. To investigate this, we also compute the deviation from the means of the different categories for 2008, 2014 and 2020. This will show if there are categories with words with large deviations. In case of large deviations, we look at the individual word scores to investigate which words have the largest influence on the bias.
Number of articles per category. A further explanation of why gender bias has changed over time could be provided by looking at the categories of the articles on Wikipedia. We therefore counted the amount of articles which contained at least one of the words of the word categories for these three available time points.

Results
Gender bias scores over time. The gender biases for Wikipedia over time are shown in Figure  2a for the different word categories. The box plots indicate the distribution of WEAT scores for random words, which changes little over time and whose mean seems close to zero, indicating that random words are almost unbiased on average. Career, Arts and Family seem to have strong biases since they fall outside the box plots, while biases in Science seem milder, as its WEAT score is comparable to those of random sets of words. Table 1 lists the p-values for whether a slope is significantly different from zero, corrected using the Benjamini-Hochberg method. Career has a strong association with male words that has not significantly changed over time. The category Science had a male bias in 2006, but this bias slowly changed over time, and is currently associated slightly more strongly with female words. This could be because the words in this category have been used in the same context as female words as opposed to male words more often since 2014. The words in the Family category have a significantly decreasing female bias, but in 2020 they are still strongly associated with female words. The Arts category is stereotypically female-associated and these words are becoming more biased towards female words, with a statistically significant slope.   WEAT scores of random words. The histograms of the slopes found from random word sets are given in Figure 3. The mean slope is 4.8 · 10 −5 , with a standard deviation of 6.3 · 10 −4 . We conclude that the whole vocabulary of Wikipedia has on average not become a lot more male or female biased over time. This is confirmed by the fact that the box plots in Figure 2a do not shift over time.
The slope for random words has a larger vari-ance when looking at only the new articles. Random word sets have a mean slope of 2.3 · 10 −4 with a standard deviation of 1.0 · 10 −3 in the word embeddings from recent articles. This shows that the larger slopes seen in the category words for recent articles might be partly caused by larger changes seen in all word embeddings (see Figure 3b). Results of new articles are therefore less reliable, also due to a smaller corpus and less time points. The p-values can be found in Table 2. Arts (.024) is the only category where the change is also significant compared to changes in random words for the complete Wikipedia corpus. All categories change significantly when considering only newly-added articles. The lower significance in comparison to random words means that despite the existence of slopes significantly different from 0, there may still be reason to doubt the effectiveness of the effort from Wikipedia. It also calls into question whether changes in bias in Table 1 were really significant.
Effect of number of word2vec iterations. We ran the training procedure of the word embeddings and computed the bias for each word category for one, five, ten and twenty iterations. The results are given in Table 3. Between one and five iterations the gender bias slope changes quite a bit. For example, the slope of Science changes from about −3.1 · 10 −3 to −1.1 · 10 −3 and the p-value of Arts varies between 0.05 and 0.01. However, most differences between five and ten iterations are smaller, including the slope values for Arts.

99
-5 -4 -3 -2 -1 0 1 2 3 4 5 Becoming more female Becoming more male Slope ( 10 3    The quality of the word embeddings also changed little after 5 iterations (see Table 4). This validates our choice of using the default value of 5 iterations. To further investigate if the slope and p-values were converged, we also tried 20 iterations. The resulting word embeddings had significantly lower quality scores (0.57 on average), with models trained on the most data (in more recent years) achieving scores as low as 0.52. We believe that this might be due to overtraining and therefore chose not to use these embeddings for measuring bias. We note that the number of iterations can influence the measured biases and should be varied to make certain the values have converged while models do not become overfitted.
Deviation within a word category. The means and standard deviations for the categories at three time points are given in Table 5   higher variance than the other categories. To understand why, we looked at the bias of each word in this category in 2020, see Table 6. The words 'wedding', 'marriage' and 'children' have a very strong female bias, whereas 'home', 'cousins' and 'family' are only slightly more female associated.   Number of articles per category. The percentage of articles which contained at least one of the words of the sets is given in Figure 4. Observe that the proportions have changed little over time, so this does not provide an explanation for the changes in bias over time. All periods thus have similar contribution to the category bias. Male words are present in more of the articles than female words.

Discussion and Future Work
Since societal gender bias is decreasing [Garg et al., 2018], we expected that using text written more recently would result in less gender biased word embeddings. We have shown that stereotypical gender bias in the categories Family and Science is indeed decreasing, but these changes are not significant in comparison to random word sets. Words related to Career did not seem to change since 2006. Bias in Arts has significantly increased, also in comparison to random words. Further research, maybe on a longer time period, is necessary to conclude what causes these changes and how significant the changes are. The vast majority of biographies in Wikipedia are about men [Envel Le Hir, 2017-2020. This discrepancy has decreased a little since 2017. This is confirmed by the fact that a lot more articles contain words from our male set than from our female set. However, we do not observe that random words are more associated with male words. This could also be seen in the fact that Science words are more female associated in 2020, despite less than 15% of the scientists with biographies being female. A possible reason for this is that articles about women contain more gender-specific words [Wagner et al., 2016], for example: 'female scientist'. The expected gender goes without saying, whereas the minority gender is explicitly specified [Pratto et al., 2007]. This causes words to become more female-associated than expected from the ratio of biographies. Wikipedia may inform its contributors about this skew in female biographies in the hope that this bias will be reduced.
To reduce gender bias in Family further, our results suggest that a focus on equal representation in the topics of marriage and children would be most beneficial. It is unclear why the Arts category is becoming more and more female biased.
When word embeddings are used in downstream tasks such as classification, our research shows it is important to consider the time of retrieval of a corpus. For example, if one wants to have a gender neutral word embedding related to Science, one may best use the corpus of 2018. Such effects may also occur in other corpora. More research is needed to further understand the quality of word embeddings as measured by performance in downstream tasks and unwanted biases in such tasks.
New articles are not gender neutral either. They have similar developments, but more strongly and also significant in comparison to random words. We could not completely determine if new articles are the cause for changes in gender bias, since we did not consider changes in existing articles. Little statistics are known relating to gender bias of Wikipedia. This makes it difficult to place our results in a wider context. Since our work indicates biases are currently increasing further for some categories, current strategies to reduce bias may need to be changed. To further improve the editing strategies of Wikipedia, more automated measures of biases may provide necessary insights.
Compared to the historical embeddings  from the study of Jones et al.
[2020], we find several differences but also agreements. In contrast, we find that Art related words are becoming more biased towards female. The bias of Family is decreasing in their study as well, however, they find less steep slopes. The decrease they found in the Career category was not found as clearly in our results, this may also be due to the shorter time span. It is hard to say where the differences stem from: perhaps due to different societal changes or because of a different platform?
One limitation of this research is the fact that no backups of Wikipedia were available between 2010 and 2014. Moreover, we did not look at what text was written exactly when. This information could provide more insight in the developments of gender bias. The current version of Wikipedia still contains text written in 2001, and thus biases in the full corpus of Wikipedia may not represent development of societal biases precisely. The analysis on only new articles may give a better estimate in that respect. However, due to the unreliability of using page ids, this still does not give a perfect representation.
The WEAT-score is not a perfect measure of gender bias of its underlying content. One of the problems is interpretability: where do the biases come from? To that end, Wikipedia's content should also be looked at in more detail. We tried to make this connection using word counts over all Wikipedia pages, but a more elaborate analysis is necessary to complement our analysis. Another option is to use the technique of Brunet et al. [2019] to find the most bias influencing articles. This will give further clues how to make Wikipedia more gender neutral.
Hamilton et al. [2016] discovered laws of semantic shift by looking at word embeddings over large time spans. These laws could explain some of our observed changes in gender bias. The most relevant law is the law of conformity: frequent words change embedding location more slowly. This might be taken to imply that the Arts category, whose words are most used on Wikipedia (see Figure 4), would change bias the least. However, the opposite is the case, as Arts has one of the steepest slopes. Sadly, we cannot compare our rates of change to those found by Hamilton et. al. since we cannot find the raw rates of change per year in their work. This could be used to place changes of WEAT-scores over time in context. We note, however, that the slopes of the categories are already (crudely) placed in context when they are compared against the slopes of random words. Here a further correction could be made with word frequencies to take the law of conformity into account. On the other hand, since our work focuses on a much shorter time scale, we can assume that such changes are negligible, especially for the WEAT words which are generally frequently used and therefore less likely to have major changes in meaning within 20 years.
Word embeddings were shown to be surprisingly unstable over restart with different random initialisation [Wendlandt et al., 2018]. In that work, stability was defined as the fraction of the 10 nearest neighbours of each word that are the same before and after the restart. Thus, this is a measure of local stability. The WEAT score is determined, however, over larger distances of word embeddings. Thus, local instability does not directly imply that WEAT scores would also be unstable. To mitigate this potential instability, we initialized each model with the same seed. While a more elaborate investigation of the stability of WEAT to multiple random restarts is out of the scope of this work, we think it is an important point to investigate in order to verify that our results and those of Jones et al. [2020] and Garg et al. [2018] are robust.
We considered the four default word sets as provided by the WEAT test, to allow comparison to Jones et al. [2020]. Remarkably, these word sets include two male names: Einstein and Shakespeare. Einstein is on average about 0.04 above the category mean of Science, and Shakespeare approximately 0.03 above the mean of Arts, influencing the category means positively, making them more male-biased. It is expected that the names Einstein and Shakespeare co-occur more with male words such as 'he' or 'him'. However, this may not be representative of the rest of Science or Arts words in general, and thus may overestimate male bias in these subjects. We realize that Einstein and Shakespeare were and still are very influential in the fields of science and arts respectively. However, if our goal is that articles about more important individuals (which might be read by more people) have higher impact on the bias calculation we could weigh articles based on notability [Wagner et al., 2016] at the embedding learning stage. To further understand the (perhaps unwanted) effects of using these two words, we believe that more research in the choice of words of WEAT is necessary.

Conclusion
In this paper, we used word embeddings to estimate changes in gender bias in Wikipedia articles over time. We found evidence that gender bias is decreasing for Science and Family, while increasing for Arts. Biases in the male associated category Career seems constant. Further analysis of these results provides insights that can potentially lead to new practices to reduce gender bias in Wikipedia even more in the future.