Good Secretaries, Bad Truck Drivers? Occupational Gender Stereotypes in Sentiment Analysis

In this work, we investigate the presence of occupational gender stereotypes in sentiment analysis models. Such a task has implications in reducing implicit biases in these models, which are being applied to an increasingly wide variety of downstream tasks. We release a new gender-balanced dataset of 800 sentences pertaining to specific professions and propose a methodology for using it as a test bench to evaluate sentiment analysis models. We evaluate the presence of occupational gender stereotypes in 3 different models using our approach, and explore their relationship with societal perceptions of occupations.


Motivation
Social Role Theory (Eagly and Steffen, 1984) shows that our ideas about gender are shaped by observing, over time, the roles that men and women occupy in their daily lives. These ideas can crystallize into rigid stereotypes about how men and women ought to behave, and what work they can and cannot do. Gendered stereotypes are powerful precisely for this reason: they define desirable and expected traits, roles and behaviors in people, and go beyond description to prescription. Such biases from the social world, when they map onto machine learning models, serve to reinforce and propagate stereotypes further.
In this paper, we look specifically at occupational gender stereotypes in the context of sentiment analysis. Sentiment analysis is increasingly being applied for recruitment, employee retention and job satisfaction in the corporate world (Costa and Veloso, 2015). Given the prevalence of occupational gender stereotypes, our study primarily deals with the question of whether sentiment analysis models display and propagate these stereotypes. To contextualize and ground our study, we 1 Link to dataset: https://bit.ly/2HLSKnf first provide a summary of the relevant sociological literature on occupational gender stereotypes.

Background
Sociological studies as early as 1975 (Shinar, 1975) investigate gender stereotypes of occupations, and rank occupations in terms of how "masculine", "feminine" or neutral they are perceived to be. Cejka and Eagly (1999) successfully predicted the gender distribution of occupations based on beliefs about how specific gender-stereotypical attributes (such as "masculine physical") contribute to occupational success. Such beliefs -that success in a male dominated profession, for example, requires male-specific traits -directly contribute to sex segregation in occupations. The study also found that high occupational prestige and wages are strongly correlated with masculine images. Together, this goes to show that occupational structure is deeply shaped by gender. More recently, Haines et al. (2016) investigate how and whether gender stereotypes have changed between 1983 and 2014, and find conclusive evidence that occupational gender stereotypes have persisted strongly through the ages and remain stable. There is ample sociological evidence to show that occupational gender stereotypes have not undergone substantial modification since the entry of women into the workplace, and that they remain pervasive and widely held by both men and women (Glick et al., 1995;Haines et al., 2016).
Since occupational gender stereotypes are shaped by subjective factors and not objective reality, they remain resistant to contrary evidence. Theories such as the backlash hypothesis (Rudman and Phelan, 2008) further explain their persistence: this theory shows how women in the workplace must disconfirm female stereotypes in order to be perceived as competent leaders, yet traits of ambition and capability in women evoke negative reactions which present a barrier to every level of occupational success.
The implications of occupational gender stereotypes are profound. Children and adolescents are particularly sensitive to gendered language used to describe occupations and form rigid occupational gender stereotypes based on this (Vervecken et al., 2013). In adults, occupational gender stereotypes directly contribute to the existence of unequal compensation and discriminatory hiring. They also lead to self-fulling prophecies: for instance, individuals may not apply to certain jobs in the first place because they think they don't fit the gender stereotype for occupational success in that field (Kay et al., 2015).
In the following section, we discuss relevant prior work on gender bias from the NLP literature. In Section 3 we describe our methodology, dataset, and experiments in greater detail. In Section 4, we present and analyze our results, and finally, Section 5 describes possible directions of future work and concludes 2 .

Prior Work
Word embeddings have been the bedrock of neural NLP models ever since the arrival of word2vec (Mikolov et al., 2013), and a variety of topics related to biases with word embeddings have been studied in prior literature. Garg et al. (2018) show the presence of stereotypes in word embeddings through the ages, while Bolukbasi et al. (2016) demonstrate explicit examples of social biases that are introduced into word embeddings trained on a large text corpus. Prior work has also dealt with occupational gender stereotypes in different areas of NLP. Caliskan et al. (2017) formulate a method to test biases (including gender stereotypes) in word embeddings, while Rudinger et al. (2018) investigate such stereotypes in the context of coreference resolution. There have also been efforts to debias word embeddings (Bolukbasi et al., 2016) and come up with gender neutral word embeddings (Zhao et al., 2018). These efforts, however, have attracted criticism suggesting that they do not actually debias embeddings but instead redistribute the bias across the embedding landscape (Gonen and Goldberg, 2019).
Recent trends have been towards replacing fixed word embeddings with large pretrained contextual 2 Source code for this paper: github.com/ jayadevbhaskaran/gendered-sentiment representations as building blocks for NLP tasks. The rise of this paradigm is characterized by the use of language models for pretraining, exemplified by models such as ELMo (Peters et al., 2018), ULMFit (Howard and Ruder, 2018), GPT (Radford, 2018), and BERT (Devlin et al., 2018).
These models have shown marked improvements over word vector based approaches for a variety of tasks. However, their complexity leads to a tradeoff in terms of interpretability. Recent works have investigated gender biases in such deep contextual representations (May et al., 2019;Basta et al., 2019) as well as their applications to coreference resolution (Zhao et al., 2019;Webster et al., 2018); however, no prior work has dealt with such models in the context of occupational gender stereotypes in sentiment analysis.  introduce the Equity Evaluation Corpus, a dataset used for measuring racial and gender biases in sentiment analysis-like systems. It was initially used to evaluate systems that predicted emotion and valence of Tweets (Mohammad et al., 2018). We use a similar approach to create a new dataset for measuring gender differences with a specific focus on occupational gender stereotypes. Our approach is model-independent and can be used for any sentiment analysis system, irrespective of model complexity.

Methodology
We create a dataset of 800 sentences, each with the following structure: noun is a/an profession. Here, noun corresponds to a male or female noun phrase, such as "This boy"/"This girl", and profession is one of 20 different professions. Each sentence is an assertion of fact, and by itself does not seek to exhibit either positive or negative sentiment. Our dataset is balanced across genders and has 20 noun phrases for each gender, leading to a total of 400 sentences per gender.
The rationale behind our selection of the 20 professions is to include a variety of gender distribution characteristics and occupation types, in correspondence with US Current Population Survey 2018 (CPS) data (Current Population Survey, 2018) and prior literature (Haines et al., 2016). We select 5 professions that are male-dominated (truck driver, mechanic, pilot, chef, soldier) and 5 that are female-dominated (teacher, flight attendant, clerk, secretary, nurse) -with domination meaning greater than 70% share in the job distribution. Next, we add professions that are slightly male-dominated (scientist, lawyer, doctor) and slightly female-dominated (writer, dancer), with slight domination meaning a 60 − 65% share in the job distribution. We also add professor, which does not have a clear definition as per CPS but has been known to have different gender splits at senior and junior levels. Finally, we include two professions that show an approximately neutral divide (tailor, gym trainer) and two which have experienced significant changes in their gender distribution over time (baker, bartender), with an increasing female representation in recent times (Haines et al., 2016). As mentioned previously, we also select our set of occupations with an eye towards representing a range of occupation types.
We evaluate 3 sentiment analysis models through our experiments. Each model is trained on the Stanford Sentiment Treebank 2 train dataset (Socher et al., 2013), which contains phrases from movie reviews along with binary (0/1) sentiment labels. We then evaluate each model on our new corpus and measure the difference in mean predicted positive class probabilities between sentences with male nouns and those with female nouns. We test 3 hypotheses (one for each model), with the null hypotheses indicating no difference in means between sentences with male and female nouns. Fig. 1 illustrates our experimental setup.
Our evaluation methodology is very similar to that used in .
For each system, we predict the positive class probability for each sentence. We then apply a paired t-test (since each pair contains a male and female version of the same template sentence) to measure if the mean predicted positive class probabilities are different across genders, using a significance level of 0.01. Since we test three hypotheses (one for each system), we apply Bonferroni correction (Bonferroni, 1936) to the p-values that we obtain. In other words, the null hypothesis is rejected only for calculated p-values less than 0.01/3. We note that we do not perform any correction to account for the fact that the sentences within each gender are not iid, and only vary in the noun and profession words.
The 3 models that we evaluate are as follows: • M.1: Bag-of-words + Logistic Regression (baseline): We build a simple bag-ofwords model, apply tf-idf weighting, and use logistic regression (implemented using scikit-learn (Pedregosa et al., 2011)) to classify sentiment. This model is a very simple approach that has nevertheless been found to work well in practice for sentiment analysis tasks, and we use it as our baseline model.
• M.2: BiLSTM: We use a bidirectional LSTM implemented in Keras (Chollet et al., 2015) to predict sentiment. The words in a sentence are represented by 300-dimensional GloVe embeddings (Pennington et al., 2014). This model is more sophisticated than the baseline and captures some contextual information and long-term dependencies (Hochreiter and Schmidhuber, 1997). This model also allows us to investigate gender differences that might be introduced through word embeddings, as described in Bolukbasi et al. (2016). While analysing the results of our experiments, we measure overall predicted mean positive probabilities (across genders) for each of the 20 professions in our newly created dataset, to identify which professions are rated as high-sentiment by these models. This helps us investigate relationships between societal perceptions of occupations and corresponding sentiment predictions from the models.
We also examine differences in sentiment among equivalent gender pairs (such as bachelor and spinster) for the 20 pairs in our dataset, to investigate differences in predicted sentiment between different sets of male/female noun pairs.  Finally, we examine differences between male and female nouns for each individual occupation, to understand which occupations are susceptible to gender stereotyping.

Results/Analysis
The main results of our experiments are shown in Table 1. Our null hypothesis is that the predicted positive probabilities for female and male sentences have identical means. We notice that M.1 (Bag-of-words + Logistic Regression) and M.2 (BiLSTM) show a statistically significant difference between the two genders, with higher predicted positive class probabilities for sentences with female nouns. This effectively represents the biases seen in the SST-2 train dataset. The dataset has 1182 sentences containing a male noun with a mean sentiment of 0.535, and 601 sentences containing a female noun with a mean sentiment of 0.599. Thus, biases present in training data can get propagated through machine learning models, and our approach can help identify these.
On the contrary, M.3 (BERT) shows that sentences with male nouns have a statistically significant higher predicted positive class probability than sentences with female nouns. One possible reason for this might be biases that propagate from the pretraining phase in BERT. This finding indicates a promising direction of future work: investigating the effects of gender biases in the large pretraining corpus versus those in the smaller fine-tuning corpus (in our case, the SST-2 train dataset).

Social Stereotypes of Occupations
We now look at mean distributions of positive class probability (across genders) for each profession, as shown in Table 2. We notice that secretary shows up as a high positive sentiment profes-   On further investigation, we notice that this artefact arises because of the 2002 movie Secretary, starring Maggie Gyllenhall, that has a number of positive reviews that form a part of the SST-2 train dataset. However, M.3 (BERT) seems to be impervious to this, indicating that extensive pretraining could have the potential to remove certain corpus-specific effects that might have lingered in shallower models. The profession with the lowest average sentiment score across all 3 models is truck driver; other low scoring professions include clerk, gym trainer and flight attendant. We also note that the highest scoring profession (average sentiment 0.99) with M.3 (BERT) is scientist and the lowest (average sentiment 0.34) is truck driver, disturbingly reflective of societal stereotypes about white-collar and blue-collar jobs.
To explore this further, we look at data from the Current Population Survey of the US Bureau of Labor Statistics (Current Population Survey, 2018). Fig. 2 shows the relationship between median weekly earnings (for occupations where data is available) and average positive sentiment predicted by BERT. While there are some outliers, the figure shows a positive correlation between earnings and sentiment, indicating that the model may have incorporated societal perceptions around different occupations. We note that this is only a rough analysis, as not all occupations directly correspond to entries from the survey data.

Gendered Stereotypes
We attempt to analyze differences in gender within occupations by studying the predictions of M.3 (BERT), which incorporates the largest amount of external data. First, we analyze differences in mean positive class probability between sentences with male and female nouns for each profession. We notice that pilot has the highest positive difference between female and male noun sentences (i.e., female is higher), while flight attendant has the most negative difference (i.e., male is higher). This provides an interesting dichotomy: pilot is a male-dominated profession, while flight attendant is a female-dominated one.
To test whether these are just artefacts of generic gender bias in the model or specific to occupational gendered stereotypes, we replace profession with "person" to create 20 sentence pairs such as "This man/this woman is a person.", and predict the sentiment for these 20 pairs. We notice that the difference between female and male noun sentences for the control experiment is 0.039, showing that sentences with female nouns in the control group exhibit higher positive sentiment that those with male nouns. The three occupations with the most negative difference (i.e., female sentences have lower positive sentiment) are flight attendant (−0.132), bartender (−0.126), and clerk (−0.116). Of these, flight attendant (72%) and clerk (86%) are female-dominated professions (Current Population Survey, 2018), while bartender (55%) is a profession that has been shifting from male to female-dominated in recent times (Haines et al., 2016).
Finally, we study differences between corresponding pairs of female and male nouns, using predictions from M.3 (BERT). Out of the 20 pairs in our dataset, the pair with the greatest difference in mean positive class probability is spinster and bachelor, with spinster − bachelor = −0.404 (p < 0.01). This reflects societal perceptions of spinster as someone who is characterized as alone, lonely and resembling an "old maid", versus bachelor as someone who might be young, carefree and fun-loving (Nieuwets, 2015). This is an example of semantic pejoration seen in society, where the female form of the noun (i.e., spinster) gradually acquires a negative connotation. Notably, this pejorative behavior may have also leaked into the model, reflecting societal gender stereotypes.

Conclusion/Future Work
In this paper, we introduce a new dataset that can be used to test the presence of occupational gender stereotypes in any sentiment analysis model. We then train 3 sentiment analysis models and evaluate them using our dataset. Following that, we analyze our results, exploring social stereotypes of occupations as well as gendered stereotypes. We find that all 3 models that we study exhibit differences in mean predicted positive class probability between genders, though the directions vary. We also see that simpler models may be more susceptible to biases seen in the training dataset, while deep contextual models may exhibit biases potentially introduced during pretraining.
One promising avenue for future work is to explore occupational stereotypes in deep contextual models by analyzing their training corpora. This could also help identify techniques to mitigate biases in such models, since they could be relatively impervious to biases introduced by finetuning (especially on smaller datasets).
From a sociological perspective, we plan to investigate occupational gender stereotypes in downstream applications such as automated resume screening. Such a task assumes greater importance with the increased use of these systems in today's world. There is prior work on ethnic bias in such tools (Derous and Ryan, 2018), and we believe that there is significant value in exploring and characterizing gender biases in these systems.