Quantifying 60 Years of Gender Bias in Biomedical Research with Word Embeddings

Gender bias in biomedical research can have an adverse impact on the health of real people. For example, there is evidence that heart disease-related funded research generally focuses on men. Health disparities can form between men and at-risk groups of women (i.e., elderly and low-income) if there is not an equal number of heart disease-related studies for both genders. In this paper, we study temporal bias in biomedical research articles by measuring gender differences in word embeddings. Specifically, we address multiple questions, including, How has gender bias changed over time in biomedical research, and what health-related concepts are the most biased? Overall, we find that traditional gender stereotypes have reduced over time. However, we also find that the embeddings of many medical conditions are as biased today as they were 60 years ago (e.g., concepts related to drug addiction and body dysmorphia).


Introduction
It is important to develop gender-specific bestpractice guidelines for biomedical research (Holdcroft, 2007). If research is heavily biased towards one gender, then the biased guidance may contribute towards health disparities because the evidence drawn-on may be questionable (i.e., not well studied). For example, there is more research funding for the study of heart disease in men (Weisz et al., 2004). Therefore, the at-risk populations of older women in low economic classes are not as well-investigated. Therefore, this opens up the possibility for an increase in the health disparities between genders.
Among informatics researchers, there has been increased interest in understanding, measuring, and overcoming bias associated with machine learning methods. Researchers have studied many applica-tion areas to understand the effect of bias. For example, Kay et al. (2015) found that the Google image search application is biased (Kay et al., 2015). Specifically, they found an unequal representation of gender stereotypes in image search results for different occupations (e.g., all police images are of men). Likewise, ad-targeting algorithms may include characteristics of sexism and racism (Datta et al., 2015;Sweeney, 2013). Sweeney (2013) found that the names of black men and women are likely to generate ads related to arrest records. In healthcare, much of the prior work has studied the bias in the diagnosis process made by doctors (Young et al., 1996;Hartung and Widiger, 1998). There have also been studies about ethical considerations about the use of machine learning in healthcare (Cohen et al., 2014).
It is possible to analyze and measure the presence of gender bias in text. Garg et al. (2018) analyzed the presence of well-known gender stereotypes over the last 100 years. Hamberg (2008) shown that gender blindness and stereotyped preconceptions are the key cause for gender bias in medicine. Heath et al. (2019) studied the genderbased linguistic differences in physician trainee evaluations of medical faculty. Salles et al. (2019) measured the implicit and explicit gender bias among health care professionals and surgeons. Feldman et al. (2019) quantified the exclusion of females in clinical studies at scale with automated data extraction. Recently, researchers have studied methods to quantify gender bias using word embeddings trained on biomedical research articles (Kurita et al., 2019). Kurita et al. (2019) shown that the resulting embeddings capture some well-known gender stereotypes. Moreover, the embeddings exhibit the stereotypes at a lower rate than embeddings trained on other corpora (e.g., Wikipedia). However, to the best of our knowledge, there has not been an automated temporal study in the change of gender bias.
In this paper, we look at the temporal change of gender bias in biomedical research. To study social biases, we make use of word embeddings trained on different decades of biomedical research articles. The two main question driving this work are, In what ways has bias changed over time, and Are there certain illnesses associated with a specific gender? We leverage three computational techniques to answer these questions, the Word Embedding Association Test (WEAT) (Caliskan et al., 2017), the Embedding Coherence Test (ECT) (Dev and Phillips, 2019), and Relational Inner Product Association (RIPA) (Ethayarajh et al., 2019). To the best of our knowledge, this will be the first temporal analysis of bias of word embeddings trained on biomedical research articles. Moreover, to the best of our knowledge, this is the first analysis that measures the gender bias associated with individual biomedical words.
Our work is most similar to Garg et al. (2018). Garg et al. (2018) study the temporal change of both gender and racial biases using word embeddings. Our work substantially differs in three ways. First, this paper is focused on biomedical literature, not general text corpora. Second, we analyze gender stereotypes using three distinct methods to see if the bias is robust to various measurement techniques. Third, we extend the study beyond gender stereotypes. Specifically, we look at bias in sets of occupation words, as well as bias in mental health-related word sets. Moreover, we quantify the bias of individual occupational and mental health-related words.
In summary, the paper makes the following contributions: • We answer the question; How has the usage of gender stereotypes changed in the last 60 years of biomedical research? Specifically, we look at the change in well-known gender stereotypes (e.g., Math vs Arts, Career vs Family, Intelligence vs Appearance, and occupations) in biomedical literature from 1960 to 2020.
• The second contribution answers the question; What are the most gender-stereotyped words for each decade during the last 60 years, and have they changed over time? This contribution is more focused than simply looking at traditional gender stereotypes. Specifically, we analyze two groups of words: occupations and mental health disorders. For each group, we measure the overall change in bias over time. Moreover, we measure the individual bias associated with each occupation and mental health disorder.

Related Work
In this section, we discuss research related to the three major themes of this paper: gender disparities in healthcare, biomedical word embeddings, and bias in natural language processing (NLP).

Gender Disparities in Healthcare.
There is evidence of gender disparities in the healthcare system, from the diagnosis of mental health disorders to differences in substance abuse. An important question is, Do similar biases appear in biomedical research? In this work, while we explore traditional gender stereotypes (e.g., Intelligence vs Appearance), we also measure potential bias in the occupations and mental health-related disorders associated with each gender. With regard to mental health, as an example, affecting more than 17 million adults in the United States (US) alone, major depression is one of the most common mental health illnesses (Pratt and Brody, 2014). Depression can cause people to lose pleasure in daily life, complicate other medical conditions, and possibly lead to suicide (Pratt and Brody, 2014). Moreover, depression can occur to anyone, at any age, and to people of any race or ethnic group. While treatment can help individuals suffering from major depression, or mental illness in general, only about 35% of individuals suffering from severe depression seek treatment from mental health professionals. It is common for people to resist treatment because of the belief that depression is not serious, that they can treat themselves, or that it would be seen as a personal weakness rather than a serious medical illness (Gulliver et al., 2010). Unfortunately, while depression can affect anyone, women are almost twice as likely as men to have had depression (Albert, 2015). Moreover, depression is generally higher among certain demographic groups, including, but not limited to, Hispanic, non-Hispanic black, low income, and low education groups (Bailey et al., 2019). The focus of this paper is to understand the impact of these mental health disparities in word embeddings trained on biomedical corpora.

Biomedical Word Embeddings.
Word embeddings capture the distributional nature between words (i.e., words that appear in similar contexts will have a similar vector encoding). Over the years, there have been multiple methods of producing word embeddings, including, but not limited to, latent semantic analysis (Deerwester et al., 1990), Word2Vec (Mikolov et al., 2013a,b), and GLOVE (Pennington et al., 2014). Moreover, pretrained word embeddings have been shown to be useful for a wide variety of downstream biomedical NLP tasks , such as text classification (Rios and Kavuluru, 2015), named entity recognition (Habibi et al., 2017), and relation extraction (He et al., 2019). In Chiu et al. (2016), the authors study a standard methodology to train good biomedical word embeddings. Essentially, they study the impact of the various Word2Vec-specific hyperparameters. In this paper, we use the strategies proposed in Chiu et al. (2016) to train optimal decade-specific biomedical word embeddings.

Bias and Natural Language Processing.
Unfortunately, because word embeddings are learned using naturally occurring data, implicit biases expressed in text will be transferred to the vectors. Bias (and fairness) is an important topic among natural language processing researchers. Bias has been found in word embeddings (Bolukbasi et al., 2016;Zhao et al., 2018Zhao et al., , 2019, text classification models (Dixon et al., 2018;Park et al., 2018;Badjatiya et al., 2019;Rios, 2020), and in machine translation systems (Font and Costa-jussà, 2019;Escudé Font, 2019). In general, each paper generally focuses on either testing whether bias exists in various models, or on removing bias from classification models for specific applications.
Much of the work on measuring (gender) bias using word embeddings neither studies the temporal aspect (i.e., how bias changes over time) nor focuses on biomedical research (Chaloner and Maldonado, 2019). developed a technique to study 100 years of gender and racial bias using word embeddings. They evaluated the bias over time using the US Census as a baseline to compare embedding bias to demographic and occupation shifts. There has Year # Articles 1960-19691,479,370 1970-19792,305,257 1980-19893,322,556 1990-19994,109,739 2000-20106,134,431 2010-2020 Total 26,037,973 also been work on measuring bias in sentence embeddings (May et al., 2019). Furthermore, there has been a significant amount of research that explores different ways to measure bias in word embeddings (Caliskan et al., 2017;Dev and Phillips, 2019;Ethayarajh et al., 2019). In this work, we make use of many of the bias measurement techniques (Caliskan et al., 2017;Dev and Phillips, 2019;Ethayarajh et al., 2019) to apply them to the biomedical domain.

Dataset
We analyze PubMed-indexed titles and abstracts published anytime between 1960 and 2020. The total number of articles per decade are shown in Table 1. The text is lower-cased and tokenized using the SimpleTokenizer available in GenSim (Khosrovian et al., 2008). We find that the total number of papers have grown substantially each decade, from 1.4 million indexed articles in the 1960s to 8.6 million in the 2010s. Yet, the rate of growth stayed relatively stable each decade.

Method
We train the Skip-Gram model on PubMed-indexed titles and abstracts from 1960 to 2020. The hyperparameters of the Skip-Gram model are optimized independently for each decade. Next, given the best set of embeddings for each decade, we explore three different techniques to measure bias: the Word Embedding Association Test (WEAT), the Embedding Coherence Test (ECT), and the Relational Inner Product Association (RIPA). Each method allows us to quantify bias in different ways, such as comparing multiple sets of words (e.g., comparing the bias with respect to Career vs Family), comparing a single set of words (e.g., occupations), and measuring the bias of individual words (e.g., nurse). In this section, we briefly discuss the procedure we used to train the word embeddings,

Attribute Words
Male vs Female X male, man, boy, brother, he, him, his, son, father, uncle, grandfather Y female, woman, girl, sister, she, her, hers, daughter, mother, aunt, grandmother  To find the best model, as we search over the various hyper-parameters, we make use of the UMLS-Sim dataset (McInnes et al., 2009). UMLS-Sim consists of 566 medical concept pairs for measuring similarity. The degree of association between terms in UMLS-Sim was rated by four medical residents from the University of Minnesota medical school. All these clinical terms correspond to Unified Medical Language System (UMLS) concepts included in the Metathesaurus (Bodenreider, 2004). Evaulation is performed using Spearman's rho rank correlation between a vector of cosine similarities between each of the 566 pairs of words and their respective medical-resident ratings. Intuitively, the ranking of the pairs using cosine similarity, from most similar pairs to the least, should be similar to the human (medical expert) annotations.

Word Embedding Association Test
The implicit bias test measures unconscious prejudice (Greenwald et al., 1998). WEAT is a gener-alization of the implicit bias test for word embeddings, measuring the association between two sets of target concepts and two sets of attributes. We use the same target and attribute sets from Kurita et al. (2019). We list the targets and attributes in Table 2. The attribute sets of words are related to the groups in which the embeddings are biases towards or against, e.g., Male vs Female. The words in the target categories-Career vs Family, Math vs Arts, Science vs Arts, Intelligence vs Appearance, and Strength vs Weakness-represent the specific types of biases. For example, using the attributes and targets, we want to know whether the learned embeddings that represent men are more related to career than the female-related words (i.e., test if female words are more related to family, than male words).
Formally, let X and Y be equal-sized sets of target concept embeddings and let A and B be sets of attribute embeddings. To measure the bias, we follow Caliskan et al. (2017), which defines the following test statistic that is the difference between the sums over the respective target concepts, where s(w, A, B) measures the association between a single target word w (e.g., career) with each of the attribute (gendered) words as such that cos() represents the cosine similarity between two vectors. w ∈ R d , a ∈ R d , and b ∈ R d represents the word embedding for x, y, and w, respectively. Similarly, d is the dimension of each word embedding. Instead of using the test statistic directly, to measure bias, we use the effect size. Effect size is a normalized measure of the separation of the two distributions, defined as where µ x∈X and µ y∈Y represent the mean score over target words for a specific attribute word. Likewise, σ w∈X∪Y is the standard deviation of the scores for the word w in the union of X and Y . Intuitively, a positive score means that the attribute words in X (e.g., male, man, boy) are more similar to the target words A (e.g., strong, power, dominant) than Y (e.g., female, woman, girl). Moreover, larger effects represent more biased embeddings. As previously stated, the Attribute and Target words are from Kurita et al. (2019). It is important to note that the list is manually curated. Moreover, the bias measurement can change depending on the exact list of words. RIPA is more robust to slight changes to the attribute words than WEAT (Ethayarajh et al., 2019).

Embedding Coherence Test.
We also explore a second method of measuring bias, the Embedding Coherence Test (ECT) (Dev and Phillips, 2019). Unlike WEAT, it compares the attribute Words (e.g., Male vs Female) with a single target set (e.g., Career). Thus, we do not need two contrasting target sets (e.g., Career vs Family) to measure bias. We take advantage of this to measure bias associated with occupations and mental health-related disorders. Specifically, we use a total of 290 occupation words and 222 mental health-related words. The occupation words come from prior work measuring per-word bias (Dev and Phillips, 2019). To form a list of mental health words, we use the Diagnostic and Statistical Manual of Mental Disorders (DSM-5), a taxonomic and

Year
Sim 1960-1969 .6586 101 1970-1979 .6715 207 1980-1989 .7033 277 1990-1999 .7282 265 2000-2010 .7078 272 2010-2020 .6867 306 , 2013). For each mental health disorder in DSM-5, which are generally multi-word expressions, we split it into individual words. Next, we manually remove uninformative adjective and function words. For example, the disorder "Specific learning disorder, with impairment in mathematics" is tokenized into the following words: "learning", "disorder", "impairment", and "mathematics". A complete listing of the occupational and mental health words can be found in the appendix. Formally, ECT first computes the mean vectors for the attribute word sets X and Y, defined as where v X ∈ R d and |X| represents the number of words in category X. v Y is calculated similarly. For both v X and v Y , ECT computes the (cosine) similarities with all vectors a ∈ A, i.e., the cosine similarity is calculated between each target word a and v X and stored in s X ∈ R |A| . The two resultant vectors of similarity scores, s X (for X) and s Y (for Y ) are used to obtain the final ECT score. It is the Spearman's rank correlation between the rank orders of s X and s Y -the higher the correlation, the lower the bias. Intuitively, if the correlation is high, then the rank of target words based on similarity is correlated when calculated for the both X and Y (i.e., male and female).

Results
In this section, we present the results of our study in four parts. First, we report the embedding quality using UMLS-sim. Second, we study the temporal bias of traditional gender stereotypes, such as Career vs Family and Strong vs Weak. Ideally, we want to understand how, and which, stereotypes have changed over time. To understand the biased stereotypes, we make use of the WEAT method. Third, we look at whether occupational and mental health-related words are biased, and how the bias has changed over time. For this result, we only use a single set of target words. Thus, we make use of ECT. Fourth, we use RIPA to find the most biased words for each gender in each decade.

Embedding Quality.
In Table 3, we report the quality of each decade's embeddings based on the UMLS-sim dataset. Overall, we find that the quality consistently improves until the 1990s, however, we see drops in the 2000s and 2010s. We hypothesize that the reason for the decrease in embedding quality is because of the growth of research articles indexed on PubMed. Intuitively, word embeddings are only able to capture a single sense of a word. However, given the breadth of articles PubMed indexes-from machine learning (e.g., BioNLP) to biomaterialsmultiple word meanings are being stored in a single vector. Thus, the overall quality begins to drop.

Traditional Gender Stereotypes.
In Figure 1, we plot the bias scores reported using WEAT. Remember, a large positive score means 1 9 6 0 -1 9 6 9 1 9 7 0 -1 9 7 9 1 9 8 0 -1 9 8 9 1 9 9 0 -1 9 9 9 2 0 0 0 -2 0 1 0 2 0 1 0 -2 0 2 0 1 9 6 0 -1 9 6 9 1 9 7 0 -1 9 7 9 1 9 8 0 -1 9 8 9 1 9 9 0 -1 9 9 9 2 0 0 0 -2 0 1 0 that the male words are more similar to the targets A (e.g., career) than the f emale words. There is no measurable bias with a value of zero. Overall, we find that the results from the WEAT test vary depending on the stereotype. For Career vs Family, in Figure 1a, we find a steady linear decrease in bias each decade-with the exception of the 1990s. We also find similar linear decreases in bias for both Science vs Art and Strong vs Weak (Figures 1c  and 1e). In Figure 1b, for Math vs Art, however, the bias stays relatively static, i.e., it does not dramatically change over time. Moreover, the WEAT score for Math vs Art is negative, meaning that the female words are more similar to math than the male words. Likewise, for Intelligence vs Appearance (Figure 1d), we see relatively little bias from 1960 to 1989, however, in the 1990s and 2000s, we had a substantial jump in the bias score.
Our evaluation supports prior work evaluating bias in biomedical word embeddings (e.g., Strong vs Weak is the most biased stereotype in biomedical literature) (Chaloner and Maldonado, 2019, Table 2). However, we also find differences when measuring bias over time. For example, we find that from 2010 to 2019 there is not a lot evidence for the Career vs Family stereotype in biomedical corpora, matching the results from Chaloner and Maldonado (2019, Table 2). Yet, this is only a recent phenomenon. The embeddings trained on articles published from 1990 to 1999 exhibit a Career vs Family bias score greater than 1.5. Overall, comparing to Chaloner and Maldonado (2019, Table 2), this means that the bias in recently published biomedical literature may not be as strong as what is found in general text corpora. But, if we exclude the most recent decade's embeddings, the bias in biomedical literature becomes much stronger. Future work should explore comparing the temporal bias in general text corpora to what is found in biomedical literature.

Occupational and Mental Health Bias.
In Figure 2, we report the gender bias results from ECT on two categories: occupations (e.g., doctor, nurse, teacher) and mental health disorders (e.g., depression, alcoholism, PTSD). Again, unlike WEAT, ECT calculates bias scores on a single target set of words. Therefore, we do not need two contrasting target word sets (e.g., Math vs Art), instead we can focus on bias for a single set (e.g., Math). Also, the larger the score, the lower the bias-a score of one would represent no difference between male and female words for that specific target set. Interestingly, we find that the ECT scores follow a similar pattern as found in Table 3, the better the embedding quality, the lower the bias.
Comparing Figures 2a and 2b, we find that the word embeddings for both occupations and mental disorders have relatively little bias in the 1990s. Furthermore, while there was small variation, mental disorders experienced little change in bias decade-by-decade. Yet, occupation-related words had a substantial amount of bias in the 1960s and 1970s. Moreover, we find that the bias related to occupations experienced more change, than mental disorders, starting 0.83 in the 1960s and increase by more than ten points to 0.94 in the 1990s. Whereas, mental disorder-related bias scores only ranged from 0.90 to 0.94.

Biased Words.
In Figure 2, we analyze the bias of individual occupational and mental health-related words. We found a substantial change in the bias of occupational-related words.
We found little change in the bias of mental health-related words since the 1960s. Yet, while Male Female 1970-19791980-19891990-19992000-20102010-20201970-19791980-19891990-19992000-20102010-2020 Occupations 1  promoter  conductor  chef  dentist  mediator  teacher  housewife  neurosurgeon  swimmer  priest  2  collector  chef  baker  counselor  promoter  professor  teenager  pediatrician  baker  fisherman  3  investigator  biologist  astronaut  librarian  dentist  counselor  bishop  educator  butcher  teenager  4  principal  collector  swimmer  pharmacist  principal  physician  lawyer  teenager  medic  chef  5  baker  dad  prisoner  teenager  collector  pediatrician pediatrician  counselor  barber  writer  6  researcher  singer  mechanic  bishop  cop  consultant  athlete  neurologist  physicist  nanny  7  character  chemist  character  acquaintance  conductor  doctor  physician  consultant  soldier  historian  8  mechanic  butler  worker  cardiologist  substitute  student  pathologist  dentist  baron  president  9  analyst  mechanic  soldier  promoter  coach  lawyer  educator  athlete  director  inventor  10  conductor  promoter  analyst  attorney  employee  pathologist   we found little change in mental health bias overall, are there at least a few disorders that changed over time? Moreover, we found a slight bias in mental health terms, therefore, What are the biased terms in each group? We look at the most gender biased occupational and mental health-related terms for each decade in Table 4. Because of space limitations, we only display the gendered words from the 1970s to the 2010s. The words from the 1960s can be found in the appendix. The word-level scores were generated using RIPA. First, for occupations, the words vary between male and female. For example, in the 1970s, male-related words include "mechanic", "principal", and "investigator". The female-related words include "teacher", "counselor", and "pediatrician". Interestingly, the jobs associated with men such as "principal" and "researcher" are positions with power over the jobs associated with woman. For example "principals" (male) have power over "teachers" (female) and "researchers" (male) have power over "students" (female). We also find other well-known occupations appear to be gender-related. For instance, "butler" in the 1980s is associated to male while "nanny" is related to female in the 2010s.
With regard to mental health, we find that disorders associated with well-known gender disparities appear to be biased using RIPA (Organization, 2013). For example, through the last 60 years, words associated with addictions are male-related, e.g., "caffeine", "cannabis", "nicotine", and "gambling". Similarly, disorders related to appearance are more female-related, e.g., "dysmorphic" 2 and "anorexia". We also find that disorders related to emotions are more female-related, such as "munchausen" 3 , "hysteria" 4 , and "terror". Interestingly, we find that the word "hysteria" is heavily biased in the 2010s. Even though the diagnosis of female hysteria substantially fell in the 1900s (Micale, 1993), it still seems to be a biased term. We want to note that this could simply be caused by research studying mental health diagnosis bias in women, however, the underlying cause of why the term is biased in the 2010s is left for future work.

Discussion
In this section, we discuss the impact of the results on two stakeholders of this research: BioNLP researchers and general biomedical researchers. Furthermore, we discuss the limitations of focusing on binary gender (Male vs Female).
2 Dysmorphia is a mental health disorder in which you can't stop thinking about one or more perceived defects or flaws 3 Munchausen is a mental disorder in which a person repeatedly and deliberately acts as if he or she has a physical or mental illness 4 Hysteria is a (biased) catch-all for symptoms including, but not limited to, nervousness, hallucinations, and emotional outbursts.
The results in this paper are important for BioNLP research in two ways. First, we have produced decade-specific word embeddings. 5 Therefore, BioNLP research can use the embeddings to study other historical phenomenon in biomedical research articles. Second, the analysis of historical bias in biomedical research in this paper can be applied to other domains, beyond occupations and mental disorders.

Impact on Biomedical Researchers.
With regard to general biomedical researchers (e.g., medical researchers and biologist), this work can provide a way to measure which demographics current research is leaning towards in an automated fashion. As discussed in Holdcroft (2007), if research is heavily focused on a single gender, then health disparities can increase. Treatments should be explored equally for all at-risk patients. Furthermore,with the use of contextual word embeddings (Scheuerman et al., 2019), implicit bias measurement techniques can be used as part of the writing process to avoid gendered language when it is not necessary (e.g., using singular they vs he/she).

A Note About Gender.
Similar to prior work measuring gender bias (Chaloner and Maldonado, 2019), we focus on binary gender. However, it is important to note that the results for binary gender do not necessarily generalize to other genders, including, but not limited to, binary trans people, non-binary people, gender non-conforming people (Scheuerman et al., 2019). Therefore, we want to explicitly note that our research does not necessarily generalize beyond binary gender. In future work, we recommend that researcher's studies should be performed for other genders, beyond simply studying Male vs Female.
How can this study be expanded beyond binary gender? The three bias measurement techniques studied in this paper (i.e., WEAT, ECT, and RIPA) require sets of words representing a single gender (e.g., boy, men, male). Unfortunately, there is not a large number of words to represent every gender of interest. A promising area of research is to explore bias in contextual word embeddings. With the use of contextual word embeddings (Kurita et al., 2019), we can measure the bias of individual words across many contexts. Thus, we can possibly overcome the problem of a limited number of words per gender.

Conclusion
In this paper, we studied the historical bias present in word embeddings from 1960 to 2020. In summary, we found that while some biases have shown a consistently decrease over time (e.g., Strong vs Weak), others have stayed relatively static, or worse, increased (e.g., Intelligence vs Appearance). Moreover, we found that the gender bias towards occupations has substantially changed over time, showing that in the past, there was more gender bias associated with certain jobs.
There are two major avenues for future work. First, this work quantified various aspects of gender bias over time. However, we do not know why the bias is present in the word embeddings. For example, is the word "hysteria" biased in 2010 because researchers are associating it with women implicitly, or is it that researchers are studying the historical usage of the diagnosis to ensure the diagnosis is not made because of implicit bias in the future? Thus, our future work will focus on causal studies of bias in biomedical literature. Second, we simply independently trained Skip-Gram word embeddings for each decade. However, recent work has shown that dynamic embeddings, rather than static (decade-specific), perform better with regard to analyzing public perception over time (Gillani and Levy, 2019). Future work will focus on developing new techniques to study bias temporally. Moreover, many techniques may depend on the magnitude of the bias, therefore, we plan to analyze the circumstances in which one embedding approach may measure bias (e.g., Skip-Gram) better than another (e.g., dynamic embeddings).