Identifying and Measuring Annotator Bias Based on Annotators’ Demographic Characteristics

Machine learning is recently used to detect hate speech and other forms of abusive language in online platforms. However, a notable weakness of machine learning models is their vulnerability to bias, which can impair their performance and fairness. One type is annotator bias caused by the subjective perception of the annotators. In this work, we investigate annotator bias using classification models trained on data from demographically distinct annotator groups. To do so, we sample balanced subsets of data that are labeled by demographically distinct annotators. We then train classifiers on these subsets, analyze their performances on similarly grouped test sets, and compare them statistically. Our findings show that the proposed approach successfully identifies bias and that demographic features, such as first language, age, and education, correlate with significant performance differences.


Introduction
According to the online harassment report published by Pew Research Center, "four-in-ten Americans have personally experienced online harassment, and 62% consider it a major issue." (Duggan, 2017, p.3). Online environments such as social media and discussion forums have created spaces for people to express their opinions and viewpoints, but this comes at the cost of hateful, offensive, and abusive content. Moderating this content manually requires a lot of staff and large amounts of hand-curated policies, which generated much interest in automatic content moderation systems that make use of recent advances in machine learning (Schmidt and Wiegand, 2017).
One challenge of training machine learning systems is the demand for large amounts of labeled * These authors contributed equally to this work. data. Hence, many researchers use crowdsourcing platforms to annotate their data sets Founta et al., 2018;Vidgen and Derczynski, 2020), although having expert annotators has proven to improve the quality of annotations (Waseem, 2016). Such crowdsourcing approaches, however, exposes hate speech detection systems to annotator bias. Hateful behavior can take many forms (Waseem et al., 2017), making it harder to obtain a clean, common definition of hate speech, and resulting in subjective and biased annotations. Biases in the annotations are then absorbed and reinforced by the machine learning models, causing systematically unfair systems (Bender and Friedman, 2018). Therefore, it is not surprising that a large body of work has identified and mitigated this bias (Bender and Friedman, 2018;Bountouridis et al., 2019;Dixon et al., 2018).
We already know that people with particular demographic characteristics (e.g., black, disabled, or younger people) become more frequently targets of hate (Vidgen et al., 2019b). An aspect that is sparsely investigated in this context is the relation between annotators' demographic features and a potential bias in the data set. We want to fill this gap by addressing the following research question: How do annotators' demographic features such as gender, age, education and first language impact their annotations of hateful content?
To answer this question, we conduct the following exploratory study: We sample balanced subsets of data that are labeled by demographically distinct annotators. We then train classifiers on these subsets, analyze their performances on similarly split test sets, and compare them statistically. and fairness (Vidgen et al., 2019a;Dixon et al., 2018), a lot of recent work has been done to investigate this phenomenon (Wiegand et al., 2019;Kim et al., 2020).
Some work examined racial bias (Sap et al., 2019;Davidson et al., 2019;Xia et al., 2020), others explored gender bias (Gold and Zesch, 2018), aggregation bias (Balayn et al., 2018) and political bias (Wich et al., 2020b). The type of bias we are examining in this study is the annotator bias. Waseem (2016) studied the influence of annotator expertise on classification models and found that systems trained on expert annotations outperform those trained on amateur annotations, confirming and extending the results from Ross et al. (2017). Geva et al. (2019) showed that model performance improves when exposed to annotator identifiers, which suggests that annotator bias needs to be considered when creating hate speech models. Salminen et al. (2018) studied the difference between annotations of crowd workers from 50 countries and found those differences highly significant. Binns et al. (2017) examined the effect of the gender of the annotators on the performance of classifiers. Wich et al. (2020a) studied the similarities in the behaviour of the annotators to reveal biases that they bring into the data.
To the best of our knowledge, no one has developed a method to identify annotator bias based on multiple demographic characteristics of the annotators and measure its impact on the classification performance.

Data
We used the personal attack corpora from Wikipedia's Detox project (Wulczyn et al., 2017), which contains 115,864 labeled comments from Wikipedia on whether the comment contains a form of personal attack. The labels are the following (Wikimedia, n.d.): • Quoting attack: Indicator for whether the annotator thought the comment is quoting or reporting a personal attack that originated in a different comment.
• Recipient attack: Indicator for whether the annotator thought the comment contains a personal attack directed at the recipient of the comment.
• Third party attack: Indicator for whether the  annotator thought the comment contains a personal attack directed at a third party.
• Other attack: Indicator for whether the annotator thought the comment contains a personal attack but is not quoting attack, a recipient attack or third party attack.
• Attack: Indicator for whether the annotator thought the comment contains any form of personal attack. (Wikimedia, n.d.) For our study, we used the attack label as the classification target label, not taking into consideration the other labels.
The hypothesis is that a statistically significant difference between the classifiers' performances indicates an annotator bias related to the studied demographic feature.
In the first step, we group the annotators by their demographic features, such as gender, age, education level, and native language. For each of those features, we create m + 1 datasets where m is the number of different values a demographic feature can take, e.g. for gender m could be equal to 2 if we only consider male and female annotators. All datasets have the same comments, but with different labels aggregated from annotators belonging to each different group. The additional dataset ( +1) has labels aggregated from annotators belonging to all groups. It serves as a control group. We call this dataset the mixed dataset. We measured the inter-rater agreement within each group using Krippendorff's alpha (Hayes and Krippendorff, 2007).
In the second step, we split the datasets into train and test sets, and train 20 classifiers for each group on the group's training set and report F1 scores for all test sets. We train 20 classifiers to get multiple data points for each group's classifier and then apply the Kolmogorov-Smirnov test to examine whether they are significantly different 2 . The null hypothesis in this context is that the two samples are drawn from the same distribution. If we can reject the null hypothesis (p < 0.05) for a certain demographic feature, this will be evidence that annotators belonging to different groups of feature values hold different norms and are bringing in different biases into their annotations.
Concerning the classification model, we chose to make use of recent advancements in transfer learning and employ DistilBERT as a classifier due to the limited number of data points annotated by each group. DistilBERT (Sanh et al., 2019) is a smaller and faster distilled version of BERT (Devlin et al., 2018). In the context of abusive language detection, it provides a comparable performance . We used the base uncased version of DistilBERT (distilbert-base-uncased) with a maximum sequence length of 100, a learning rate of 5 × 10 −6 , and 1cycle learning rate policy (Smith, 2018) and trained each classifier for 2 epochs. 2 We trained 20 classifiers only for practical constraints.

Data split
To ensure the comparability of the classifiers, it is necessary to compile the training and test sets in the right way. Therefore, we define the following 2 conditions for selecting the comments: (1) All data sets of one feature contain the same comments. (2) At least 6 annotators from each demographic group annotated the comment. In the case of the gender group, that means a selected comment was annotated by at least 6 male and 6 female annotators.
For each demographic feature, we create 3 training and test set combinations. In the first one, the labels are taken from a random set of 6 annotators belonging to the first demographic group (e.g., males). In the second one, the labels of the comments are taken from a random set of 6 annotators belonging to the second demographic group (e.g., females). The third train and test sets are mixed: the labels of the comments are taken from a random set of 3 annotators belonging to the first demographic group and 3 annotators belonging to the second demographic group. While the subset of comments stays unchanged, for each of the 20 classifiers we sample the annotations of different random annotators. Data sets' sizes can be found in Table 1.
We also performed the same experiments without the limitation of sharing the same comments in the data sets of each feature, in order to increase the size of comments in the splits. Results were very similar to our shared comments experiments.

Results
In this section, we report the results of our experiments for each demographic feature. The results comprise the inter-rater agreement of the annotators in the different groups, the averaged F1 scores of the trained classifiers, the sensitivity and specificity of the classifiers as charts, and the p-values generated by the Kolmogorov-Smirnov tests.

Gender
In regards to gender, we could not find evidence of any significant difference between male and female classifiers. Although the inter-rater agreement is significantly lower for females (0.45) than for males (0.51) (Table 4), the average F1 scores of the 20 classifiers trained for each group show no significant difference (Table 2). When analyzing the sensitivity and specificity graphs in Figure 1a, one can also see no significant pattern or trend. The p-value resulting from the Kolmogorov-Smirnov test applied on the F1 scores of the 20 male classifiers and 20 female classifiers evaluated on the mixed test set is 0.83 (Table 3). Since it is larger than 0.05, we cannot conclude that a significant difference between the male and female classifier exists.

First Language
Our experiments on first language classifiers resulted in the following observations: 1. Classifiers trained on native-labeled data have a notably higher F1 score (Table 2) and are also more sensitive to all test sets (the blue triangles in Figure 1b), which suggests that they are particularly better at classifying comments   that contain personal attack.
2. Classifiers trained on only non-native-labeled data perform almost as good as the baseline (classifier trained on mix-labeled data) ( Table  2).
3. We found very minor disparities in the specificity of both classifiers (Figure 1b).
The result of the Kolmogorov-Smirnov test on native and non-native classifiers is a p-value of 1.0 × 10 −3 (Table 3), thus we can reject the null hypothesis and conclude that a significant difference does exist between them.

Age group
Our experiments resulted in the following observations: 1. Classifiers trained on over-30-labeled data have higher F1 scores than classifiers trained on under-30 labeled data on all test sets. They are however comparable to the baseline (classifier trained on mix-labeled data) ( Table 2).
2. All classifiers are less sensitive to over-30labeled test set (Figure 1c), which might suggest that it contains harder examples that all classifiers failed to correctly classify.  The Kolmogorov-Smirnov test on the results of the two classifiers produces a p-value of 1.1 × 10 −8 (Table 3), thus we can reject that they come from the same distribution and conclude that a significant difference does exist between them.

Education
Our experiments resulted in the following observations: 1. The F1 scores of the classifiers trained on below-hs-labeled data are higher than scores of classifiers trained on above-hs-labeled data on all test sets (Table 2).
2. Classifiers trained on below-hs-labeled data have a comparable specificity to the other classifiers but with a notably higher sensitivity on all test sets. (Figure 1d).
The Kolmogorov-Smirnov test with a p-value of 1.4 × 10 −7 (Table 3) also shows that there exists a significant difference between the two groups.

Discussion
In light of our results, we can conclude that the gender of the annotator does not bring a significant bias in annotating personal attacks in the studied dataset. However, when Binns et al. (2017) explored the role of gender in offensive content annotations, they established a distinguishable difference between males and females. We think this is related to the nature of the annotation task itself. To investigate other tasks, our approach can further be applied in future work on the other data sets provided by Wikipedia's Detox project (Wulczyn et al., 2017) such as aggressiveness and toxicity to investigate the effects of gender for those tasks.
When it comes to the first language of the annotators, it seems that native English speakers are gen-erally better at identifying personal attacks in comments. The results also suggest that non-natives could not capture attack in comments that natives found to contain attack.
In addition, age groups and education levels of the annotators also seem to play a notable role in how attacks are perceived. Training a classifier on aggregated labels from all groups, even if the data is balanced between groups, does not seem to be fair to all groups involved.
Although we have only explored the demographic features provided by the data set and grouped some of them for reasons dictated by the data size, we think other features (e.g., race, ethnicity, and political orientation), different within feature groupings and feature intersections might produce new biases. While exploring all possible demographic features prior to building models is simply infeasible, the set of studied features can be determined per task.
Our approach demonstrated how particular training sets labeled by different groups of people can be used to identify and measure bias in data sets. These biases are never constant or static even within one group, for what counts as hateful is always subjective. In consequence, having only one version of ground truth is bound to produce biased systems. It is inevitable that training models on biased datasets produces systems that amplify those biases, whether these biases are exclusionary, prejudicial, or historical. Therefore and due to the conflicting and ever-changing definitions of hate speech among communities, we urge researchers in the hate speech domain to examine their data sets closely and thoroughly in order to understand their limitations and consequences.

Conclusion
This work explored bias in hate speech classification models where the task is inherently controversial and annotators' demographic data might influence the labels. We demonstrate how particular demographic features might bias the models in ways that are important to look into prior to using such models in production. We explored the performance of classification models trained and tested on different training and test data splits, in order to identify the fairness of these classifiers and the biases they absorb. We hope that our proposed method for identifying and measuring annotator bias based on annotators' demographic characteris-tics will help to build fairer hate speech classifiers.