Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems

Automatic machine learning systems can inadvertently accentuate and perpetuate inappropriate human biases. Past work on examining inappropriate biases has largely focused on just individual systems. Further, there is no benchmark dataset for examining inappropriate biases in systems. Here for the first time, we present the Equity Evaluation Corpus (EEC), which consists of 8,640 English sentences carefully chosen to tease out biases towards certain races and genders. We use the dataset to examine 219 automatic sentiment analysis systems that took part in a recent shared task, SemEval-2018 Task 1 ‘Affect in Tweets’. We find that several of the systems show statistically significant bias; that is, they consistently provide slightly higher sentiment intensity predictions for one race or one gender. We make the EEC freely available.


Introduction
Automatic systems have had a significant and beneficial impact on all walks of human life. So much so that it is easy to overlook their potential to benefit society by promoting equity, diversity, and fairness. For example, machines do not take bribes to do their jobs, they can determine eligibility for a loan without being influenced by the color of the applicant's skin, and they can provide access to information and services without discrimination based on gender or sexual orientation. Nonetheless, as machine learning systems become more human-like in their predictions, they can also perpetuate human biases. Some learned biases may be beneficial for the downstream application (e.g., learning that humans often use some insect names, such as spider or cockroach, to refer to unpleasant situations). Other biases can be inappropriate and result in negative experiences for some groups of people. Examples include, loan eligibility and crime recidivism prediction systems that negatively assess people belonging to a certain pin/zip code (which may disproportionately impact people of a certain race) (Chouldechova, 2017) and resumé sorting systems that believe that men are more qualified to be programmers than women (Bolukbasi et al., 2016). Similarly, sentiment and emotion analysis systems can also perpetuate and accentuate inappropriate human biases, e.g., systems that consider utterances from one race or gender to be less positive simply because of their race or gender, or customer support systems that prioritize a call from an angry male over a call from the equally angry female.
Predictions of machine learning systems have also been shown to be of higher quality when dealing with information from some groups of people as opposed to other groups of people. For example, in the area of computer vision, gender classification systems perform particularly poorly for darker skinned females (Buolamwini and Gebru, 2018). Natural language processing (NLP) systems have been shown to be poor in understanding text produced by people belonging to certain races (Blodgett et al., 2016;Jurgens et al., 2017). For NLP systems, the sources of the bias often include the training data, other corpora, lexicons, and word embeddings that the machine learning algorithm may leverage to build its prediction model.
Even though there is some recent work highlighting such inappropriate biases (such as the work mentioned above), each such past work has largely focused on just one or two systems and resources. Further, there is no benchmark dataset for examining inappropriate biases in natural language systems. In this paper, we describe how we compiled a dataset of 8,640 English sentences carefully chosen to tease out biases towards certain races and genders. We will refer to it as the Equity Evaluation Corpus (EEC). We used the EEC as a supplementary test set in a recent shared task on predicting sentiment and emotion intensity in tweets, SemEval-2018 Task 1: Affect in Tweets . 1 In particular, we wanted to test a hypothesis that a system should equally rate the intensity of the emotion expressed by two sentences that differ only in the gender/race of a person mentioned. Note that here the term system refers to the combination of a machine learning architecture trained on a labeled dataset, and possibly using additional language resources. The bias can originate from any or several of these parts. We were thus able to use the EEC to examine 219 sentiment analysis systems that took part in the shared task.
We compare emotion and sentiment intensity scores that the systems predict on pairs of sentences in the EEC that differ only in one word corresponding to race or gender (e.g., 'This man made me feel angry' vs. 'This woman made me feel angry'). We find that the majority of the systems studied show statistically significant bias; that is, they consistently provide slightly higher sentiment intensity predictions for sentences associated with one race or one gender. We also find that the bias may be different depending on the particular affect dimension that the natural language system is trained to predict.
Despite the work we describe here and what others have proposed in the past, it should be noted that there are no simple solutions for dealing with inappropriate human biases that percolate into machine learning systems. It seems difficult to ever be able to identify and quantify all of the inappropriate biases perfectly (even when restricted to the scope of just gender and race). Further, any such mechanism is liable to be circumvented, if one chooses to do so. Nonetheless, as developers of sentiment analysis systems, and NLP systems more broadly, we cannot absolve ourselves of the ethical implications of the systems we build. Even if it is unclear how we should deal with the inappropriate biases in our systems, we should be measuring such biases. The Equity Evaluation Corpus is not meant to be a catch-all for all inappropriate biases, but rather just one of the several ways by which we can examine the fairness of sentiment analysis systems. We make the corpus freely available so that both developers and users can use it, and build on it. 2

Related Work
Recent studies have demonstrated that the systems trained on the human-written texts learn humanlike biases (Bolukbasi et al., 2016;Caliskan et al., 2017). In general, any predictive model built on historical data may inadvertently inherit human biases based on gender, ethnicity, race, or religion (Sweeney, 2013;Datta et al., 2015). Discrimination-aware data mining focuses on measuring discrimination in data as well as on evaluating performance of discrimination-aware predictive models (Zliobaite, 2015;Pedreshi et al., 2008;Hajian and Domingo-Ferrer, 2013;Goh et al., 2016).
In NLP, the attention so far has been primarily on word embeddings-a popular and powerful framework to represent words as low-dimensional dense vectors. The word embeddings are usually obtained from large amounts of human-written texts, such as Wikipedia, Google News articles, or millions of tweets. Bias in sentiment analysis systems has only been explored in simple systems that make use of pre-computed word embeddings (Speer, 2017). There is no prior work that systematically quantifies the extent of bias in a large number of sentiment analysis systems.
This paper does not examine the differences in accuracies of systems on text produced by different races or genders, as was done by Hovy (2015) (Schmidt, 2015;Bolukbasi et al., 2016;Kilbertus et al., 2017;Ryu et al., 2017;Speer, 2017;Zhang et al., 2018;Zhao et al., 2018) are also beyond the scope of this paper. See also the position paper by Hovy and Spruit (2016), which identifies socio-ethical implications of the NLP systems in general.

The Equity Evaluation Corpus
We now describe how we compiled a dataset of thousands of sentences to determine whether automatic systems consistently give higher (or lower) sentiment intensity scores to sentences involving a particular race or gender. There are several ways in which such a dataset may be compiled. We present below the choices that we made.  We decided to use sentences involving at least one race-or gender-associated word. The sentences were intended to be short and grammatically simple. We also wanted some sentences to include expressions of sentiment and emotion, since the goal is to test sentiment and emotion systems. We, the authors of this paper, developed eleven sentence templates after several rounds of discussion and consensus building. They are shown in Table 1. The templates are divided into two groups. The first type (templates 1-7) includes emotion words. The purpose of this set is to have sentences expressing emotions. The second type (templates 8-11) does not include any emotion words. The purpose of this set is to have nonemotional (neutral) sentences.
The templates include two variables: <person> and <emotion word>. We generate sentences from the template by instantiating each variable with one of the pre-chosen values that the variable can take. Each of the eleven templates includes the variable <person>. <person> can be instantiated by any of the following noun phrases: • Common African American female or male first names; Common European American female or male first names; • Noun phrases referring to females, such as 'my daughter'; and noun phrases referring to males, such as 'my son'.
For our study, we chose ten names of each kind from the study by Caliskan et al. (2017) (see Table 2). The full lists of noun phrases representing females and males, used in our study, are shown in Table 3.  The second variable, <emotion word>, has two variants. Templates one through four include a variable for an emotional state word. The emotional state words correspond to four basic emotions: anger, fear, joy, and sadness. Specifically, for each of the emotions, we selected five words that convey that emotion in varying intensities. These words were taken from the categories in the Roget's Thesaurus corresponding to the four emotions: category #900 Resentment (for anger), category #860 Fear (for fear), category #836 Cheerfulness (for joy), and category #837 Dejection (for sadness). 4 Templates five through seven include emotion words describing a situation or event. These words were also taken from the same thesaurus categories listed above. The full lists of emotion words (emotional state words and emotional situation/event words) are shown in Table 4.
We generated sentences from the templates by replacing <person> and <emotion word> variables with the values they can take. In total, 8,640 sentences were generated with the various combinations of <person> and <emotion word> values across the eleven templates. We manually exam-  ined the sentences to make sure they were grammatically well-formed. 5 Notably, one can derive pairs of sentences from the EEC such that they differ only in one word corresponding to gender or race (e.g., 'My daughter feels devastated' and 'My son feels devastated'). We refer to the full set of 8,640 sentences as Equity Evaluation Corpus.

Measuring Race and Gender Bias in Automatic Sentiment Analysis Systems
The race and gender bias evaluation was carried out on the output of the 219 automatic systems that participated in SemEval-2018 Task 1: Affect in Tweets (Mohammad et al., 2018). 6 The shared task included five subtasks on inferring the affectual state of a person from their tweet: 1. emotion intensity regression, 2. emotion intensity ordinal classification, 3. valence (sentiment) regression, 4. valence ordinal classification, and 5. emotion classification. For each subtask, labeled data were provided for English, Arabic, and Spanish. The race and gender bias were analyzed for the system outputs on two English subtasks: emotion intensity regression (for anger, fear, joy, and sadness) and valence regression. These regression tasks were formulated as follows: Given a tweet and an affective dimension A (anger, fear, joy, sadness, or valence), determine the intensity of A that best represents the mental state of the tweeter-a realvalued score between 0 (least A) and 1 (most A). Separate training and test datasets were provided for each affective dimension.
Training sets included tweets along with gold intensity scores. Two test sets were provided for each task: 1. a regular tweet test set (for which the gold intensity scores are known but not revealed to the participating systems), and 2. the Equity Evaluation Corpus (for which no gold intensity labels exist). Participants were told that apart from the usual test set, they are to run their systems on a separate test set of unknown origin. 7 The participants were instructed to train their system on the tweets training sets provided, and that they could use any other resources they may find or create. They were to run the same final system on the two test sets. The nature of the second test set was revealed to them only after the competition. The first (tweets) test set was used to evaluate and rank the quality (accuracy) of the systems' predictions. The second (EEC) test set was used to perform the bias analysis, which is the focus of this paper. Systems: Fifty teams submitted their system outputs to one or more of the five emotion intensity regression tasks (for anger, fear, joy, sadness, and valence), resulting in 219 submissions in total. Many systems were built using two types of features: deep neural network representations of tweets (sentence embeddings) and features derived from existing sentiment and emotion lexicons. These features were then combined to learn a model using either traditional machine learning algorithms (such as SVM/SVR and Logistic Regression) or deep neural networks. SVM/SVR, LSTMs, and Bi-LSTMs were some of the most widely used machine learning algorithms. The sentence embeddings were obtained by training a neural network on the provided training data, a distant supervision corpus (e.g., AIT2018 Distant Supervision Corpus that has tweets with emotionrelated query terms), sentiment-labeled tweet corpora (e.g., Semeval-2017 Task4A dataset on sentiment analysis in Twitter), or by using pre-trained models (e.g., DeepMoji (Felbo et al., 2017), Skip thoughts (Kiros et al., 2015)). The lexicon features were often derived from the NRC emotion and sentiment lexicons (Mohammad and Turney, 2013; Kiritchenko et al., 2014;, AFINN (Nielsen, 2011), and Bing Liu Lexicon (Hu and Liu, 2004).
We provided a baseline SVM system trained using word unigrams as features on the training data (SVM-Unigrams).
This system is also included in the current analysis. Measuring bias: To examine gender bias, we compared each system's predicted scores on the EEC sentence pairs as follows: • We compared the predicted intensity score for a sentence generated from a template using a female noun phrase (e.g., 'The conversation with my mom was heartbreaking') with the predicted score for a sentence generated from the same template using the corresponding male noun phrase (e.g., 'The conversation with my dad was heartbreaking'). • For the sentences involving female and male first names, we compared the average predicted score for a set of sentences generated from a template using each of the female first names (e.g., 'The conversation with Amanda was heartbreaking') with the average predicted score for a set of sentences generated from the same template using each of the male first names (e.g., 'The conversation with Alonzo was heartbreaking').
Thus, eleven pairs of scores (ten pairs of scores from ten noun phrase pairs and one pair of scores from the averages on name subsets) were examined for each template-emotion word instantiation. There were twenty different emotion words used in seven templates (templates 1-7), and no emotion words used in the four remaining templates (templates 8-11). In total, 11 × (20 × 7 + 4) = 1, 584 pairs of scores were compared. Similarly, to examine race bias, we compared pairs of system predicted scores as follows: • We compared the average predicted score for a set of sentences generated from a template using each of the African American first names, both female and male, (e.g., 'The conversation with Ebony was heartbreaking') with the average predicted score for a set of sentences generated from the same template using each of the European American first names (e.g., 'The conversation with Amanda was heartbreaking'). Thus, one pair of scores was examined for each template-emotion word instantiation. In total, 1 × (20×7+4) = 144 pairs of scores were compared.
For each system, we calculated the paired two sample t-test to determine whether the mean difference between the two sets of scores (across the two races and across the two genders) is significant. We set the significance level to 0.05. However, since we performed 438 assessments (219 submissions evaluated for biases in both gender and race), we applied Bonferroni correction. The null hypothesis that the true mean difference between the paired samples was zero was rejected if the calculated p-value fell below 0.05/438.

Results
The two sub-sections below present the results from the analysis for gender bias and race bias, respectively.

Gender Bias Results
Individual submission results were communicated to the participants. Here, we present the summary results across all the teams. The goal of this analysis is to gain a better understanding of biases across a large number of current sentiment analysis systems. Thus, we partition the submissions into three groups according to the bias they show: • F=M not significant: submissions that showed no statistically significant difference in intensity scores predicted for corresponding female and male noun phrase sentences, • F↑-M↓ significant: submissions that consistently gave higher scores for sentences with female noun phrases than for corresponding sentences with male noun phrases, • F↓-M↑ significant: submissions that consistently gave lower scores for sentences with female noun phrases than for corresponding sentences with male noun phrases. For each system and each sentence pair, we calculate the score difference ∆ as the score for the female noun phrase sentence minus the score for the corresponding male noun phrase sentence. Table 5 presents the summary results for each of the bias groups. It has the following columns: • #Subm.: number of submissions in each group.
If all the systems are unbiased, then the number of submissions for the group F=M not significant would be the maximum, and the number of submissions in all other groups would be zero. • Avg. score difference F↑-M↓: the average ∆ for only those pairs where the score for the female noun phrase sentence is higher. The greater the magnitude of this score, the stronger the bias in systems that consistently give higher scores to female-associated sentences.  • Avg. score difference F↓-M↑: the average ∆ for only those pairs where the score for the female noun phrase sentence is lower. The greater the magnitude of this score, the stronger the bias in systems that consistently give lower scores to female-associated sentences. Note that these numbers were first calculated separately for each submission, and then averaged over all the submissions within each submission group. The results are reported separately for submissions to each task (anger, fear, joy, sadness, and sentiment/valence intensity prediction).
Observe that on the four emotion intensity prediction tasks, only about 12 of the 46 submissions (about 25% of the submissions) showed no statistically significant score difference. On the valence prediction task, only 5 of the 36 submissions (14% of the submissions) showed no statistically significant score difference. Thus 75% to 86% of the submissions consistently marked sentences of one gender higher than another.
When predicting anger, joy, or valence, the number of systems consistently giving higher scores to sentences with female noun phrases (21-25) is markedly higher than the number of systems giving higher scores to sentences with male noun phrases (8-13). (Recall that higher valence means more positive sentiment.) In contrast, on the fear task, most submissions tended to assign higher scores to sentences with male noun phrases (23) as compared to the number of systems giving higher scores to sentences with female noun phrases (12). When predicting sadness, the number of submissions that mostly assigned higher scores to sentences with female noun phrases (18) is close to the number of submissions that mostly assigned higher scores to sentences with male noun phrases (16). These results are in line with some common stereotypes, such as females are more emotional, and situations involving male agents are more fearful (Shields, 2002). Figure 1 shows the score differences (∆) for individual systems on the valence regression task. Plots for the four emotion intensity prediction tasks are shown in Figure 3 in the Appendix. Each point (L, M, G) on the plot corresponds to the difference in scores predicted by the system on one sentence pair. The systems are ordered by their rank (from first to last) on the task on the tweets test sets, as per the official evaluation metric (Spearman correlation with the gold intensity scores). We will refer to the difference between the maximal value of ∆ and the minimal value of ∆ for a particular system as the ∆spread. Observe that the ∆-spreads for many systems are rather large, up to 0.57. Depending on the task, the top 10 or top 15 systems as well as some of the worst performing systems tend to have smaller ∆-spreads while the systems with medium to low performance show greater sensitivity to the gender-associated words. Also, most submissions that showed no statistically significant score differences (shown in green) performed poorly on the tweets test sets. Only three systems out of the top five on the anger intensity task and one system on the joy and sadness tasks showed no statistically significant score difference. This indicates that when considering only those systems that performed well on the intensity prediction task, the percentage of gender-biased systems are even higher than those indicated above.
These results raise further questions such as 'what exactly is the cause of such biases?' and 'why is the bias impacted by the emotion task under consideration?'. Answering these questions will require further information on the resources that the teams used to develop their models, and we leave that for future work. Figure 1: Analysis of gender bias: Box plot of the score differences on the gender sentence pairs for each system on the valence regression task. Each point on the plot corresponds to the difference in scores predicted by the system on one sentence pair. L represents F↑-M↓ significant group, M represents F↓-M↑ significant group, and G represents F=M not significant group. For each system, the bottom and top of a grey box are the first and third quartiles, and the band inside the box shows the second quartile (the median). The whiskers extend to 1.5 times the interquartile range (IQR = Q3 -Q1) from the edge of the box. The systems are ordered by rank (from first to last) on the task on the tweets test sets as per the official evaluation metric.
Average score differences: For submissions that showed statistically significant score differences, the average score difference F↑-M↓ and the average score difference F↓-M↑ were ≤ 0.03. Since the intensity scores range from 0 to 1, 0.03 is 3% of the full range. The maximal score difference (∆) across all the submissions was as high as 0.34. Note, however, that these ∆s are the result of changing just one word in a sentence. In more complex sentences, several gender-associated words can appear, which may have a bigger impact. Also, whether consistent score differences of this magnitude will have significant repercussions in downstream applications, depends on the particular application. Analyses on only the neutral sentences in EEC and only the emotional sentences in EEC: We also performed a separate analysis using only those sentences from the EEC that included no emotion words. Recall that there are four templates that contain no emotion words. 8 Tables 6 shows these results. We observe similar trends as in the analysis on the full set. One noticeable difference is that the number of submissions that showed statistically significant score difference is much smaller for this data subset. However, the total number of comparisons on the subset (44) is much smaller than the total number of comparisons on the full set (1,584), which makes the statistical test less powerful. Note also that the average score differences on the subset (columns 3 8 For each such template, we performed eleven score comparisons (ten paired noun phrases and one pair of averages from first name sentences).

Task
Avg. score diff.  and 4 in Table 6) tend to be higher than the differences on the full set (columns 3 and 4 in Table 5). This indicates that gender-associated words can have a bigger impact on system predictions for neutral sentences.
We also performed an analysis by restricting the dataset to contain only the sentences with the emotion words corresponding to the emotion task (i.e., submissions to the anger intensity prediction  task were evaluated only on sentences with anger words). The results (not shown here) were similar to the results on the full set.

Race Bias Results
We did a similar analysis for race as we did for gender. For each submission on each task, we calculated the difference between the average predicted score on the set of sentences with African American (AA) names and the average predicted score on the set of sentences with European American (EA) names. Then, we aggregated the results over all such sentence pairs in the EEC. Table 7 shows the results. The table has the same form and structure as the gender result tables. Observe that the number of submissions with no statistically significant score difference for sentences pertaining to the two races is about 5-11 (about 11% to 24%) for the four emotions and 3 (about 8%) for valence. These numbers are even lower than what was found for gender.
The majority of the systems assigned higher scores to sentences with African American names on the tasks of anger, fear, and sadness intensity prediction. On the joy and valence tasks, most submissions tended to assign higher scores to sen-tences with European American names. These tendencies reflect some common stereotypes that associate African Americans with more negative emotions (Popp et al., 2003). Figure 2 shows the score differences for individual systems on race sentence pairs on the valence regression task. Plots for the four emotion intensity prediction tasks are shown in Figure 4 in the Appendix. Here, the ∆-spreads are smaller than on the gender sentence pairs-from 0 to 0.15. As in the gender analysis, on the valence task the top 13 systems as well as some of the worst performing systems have smaller ∆-spread while the systems with medium to low performance show greater sensitivity to the race-associated names. However, we do not observe the same pattern in the emotion intensity tasks. Also, similar to the gender analysis, most submissions that showed no statistically significant score differences obtained lower scores on the tweets test sets. Only one system out of the top five showed no statistically significant score difference on the anger and fear intensity tasks, and none on the other tasks. Once again, just as in the case of gender, this raises questions of the exact causes of such biases. We hope to explore this in future work.

Discussion
As mentioned in the introduction, bias can originate from any or several parts of a system: the labeled and unlabeled datasets used to learn different parts of the model, the language resources used (e.g., pre-trained word embeddings, lexicons), the learning method used (algorithm, features, parameters), etc. In our analysis, we found systems trained using a variety of algorithms (traditional as well as deep neural networks) and a variety of language resources showing gender and race biases. Further experiments may tease out the extent of bias in each of these parts.
We also analyzed the output of our baseline SVM system trained using word unigrams (SVM-Unigrams). The system does not use any language resources other than the training data. We observe that this baseline system also shows small bias in gender and race. The ∆-spreads for this system were quite small: 0.09 to 0.2 on the gender sentence pairs and less than 0.002 on the race sentence pairs. The predicted intensity scores tended to be higher on the sentences with male noun phrases than on the sentences with female noun Figure 2: Analysis of race bias: Box plot of the score differences on the race sentence pairs for each system on the valence regression task. Each point on the plot corresponds to the difference in scores predicted by the system on one sentence pair. L represents AA↑-EA↓ significant group, M represents AA↓-EA↑ significant group, and G represents AA=EA not significant group. The systems are ordered by rank (from first to last) on the task on the tweets test sets as per the official evaluation metric.
phrases for the tasks of anger, fear, and sadness intensity prediction. This tendency was reversed on the task of valence prediction. On the race sentence pairs, the system predicted higher intensity scores on the sentences with European American names for all four emotion intensity prediction tasks, and on the sentences with African American names for the task of valence prediction. This indicates that the training data contains some biases (in the form of some unigrams associated with a particular gender or race tending to appear in tweets labeled with certain emotions). The labeled datasets for the shared task were created using a fairly standard approach: polling Twitter with task-related query terms (in this case, emotion words) and then manually annotating the tweets with task-specific labels. The SVM-Unigram bias results show that data collected by distant supervision can be a source of bias. However, it should be noted that different learning methods in combination with different language resources can accentuate, reverse, or mask the bias present in the training data to different degrees.

Conclusions and Future Work
We created the Equity Evaluation Corpus (EEC), which consists of 8,640 sentences specifically chosen to tease out gender and race biases in natural language processing systems. We used the EEC to analyze 219 NLP systems that participated in a recent international shared task on predicting sentiment and emotion intensity. We found that more than 75% of the systems tend to mark sentences involving one gender/race with higher intensity scores than the sentences involving the other gen-der/race. We found such biases to be more widely prevalent for race than for gender. We also found that the bias can be different depending on the particular affect dimension involved.
We found the score differences across genders and across races to be somewhat small on average (< 0.03, which is 3% of the 0 to 1 score range). However, for some systems the score differences reached as high as 0.34 (34%). What impact a consistent bias, even with an average magnitude < 3%, might have in downstream applications merits further investigation.
We plan to extend the EEC with sentences associated with country names, professions (e.g., doctors, police officers, janitors, teachers, etc.), fields of study (e.g., arts vs. sciences), as well as races (e.g., Asian, mixed, etc.) and genders (e.g., agender, androgyne, trans, queer, etc.) not included in the current study. We can then use the corpus to examine biases across each of those variables as well. We are also interested in exploring which systems (or what techniques) accentuate inappropriate biases in the data and which systems mitigate such biases. Finally, we are interested in exploring how the quality of sentiment analysis predictions varies when applied to text produced by different demographic groups, such as people of different races, genders, and ethnicities.
The Equity Evaluation Corpus and the proposed methodology to examine bias are not meant to be comprehensive. However, using several approaches and datasets such as the one proposed here can bring about a more thorough examination of inappropriate biases in modern machine learning systems.