Racial Bias in Hate Speech and Abusive Language Detection Datasets

Technologies for abusive language detection are being developed and applied with little consideration of their potential biases. We examine racial bias in five different sets of Twitter data annotated for hate speech and abusive language. We train classifiers on these datasets and compare the predictions of these classifiers on tweets written in African-American English with those written in Standard American English. The results show evidence of systematic racial bias in all datasets, as classifiers trained on them tend to predict that tweets written in African-American English are abusive at substantially higher rates. If these abusive language detection systems are used in the field they will therefore have a disproportionate negative impact on African-American social media users. Consequently, these systems may discriminate against the groups who are often the targets of the abuse we are trying to detect.


Introduction
Recent work has shown evidence of substantial bias in machine learning systems, which is typically a result of bias in the training data. This includes both supervised (Blodgett and O'Connor, 2017;Tatman, 2017;Kiritchenko and Mohammad, 2018;De-Arteaga et al., 2019) and unsupervised natural language processing systems (Bolukbasi et al., 2016;Caliskan et al., 2017;Garg et al., 2018). Machine learning models are currently being deployed in the field to detect hate speech and abusive language on social media platforms including Facebook, Instagram, and Youtube. The aim of these models is to identify abusive language that directly targets certain individuals or groups, particularly people belonging to protected categories (Waseem et al., 2017). Bias may reduce the accuracy of these models, and at worst, will mean that the models actively discriminate against the same groups they are designed to protect.
Our study focuses on racial bias in hate speech and abusive language detection datasets (Waseem, 2016;Waseem and Hovy, 2016;Golbeck et al., 2017;Founta et al., 2018), all of which use data collected from Twitter. We train classifiers using each of the datasets and use a corpus of tweets with demographic information to compare how each classifier performs on tweets written in African-American English (AAE) versus Standard American English (SAE) (Blodgett et al., 2016). We use bootstrap sampling (Efron and Tibshirani, 1986) to estimate the proportion of tweets in each group that each classifier assigns to each class. We find evidence of systematic racial biases across all of the classifiers, with AAE tweets predicted as belonging to negative classes like hate speech or harassment significantly more frequently than SAE tweets. In most cases the bias decreases in magnitude when we condition on particular keywords which may indicate membership in negative classes, yet it still persists. We expect that these biases will result in racial discrimination if classifiers trained on any of these datasets are deployed in the field.

Related works
Scholars and practitioners have recently been devoting more attention to bias in machine learning models, particularly as these models are becoming involved in more and more consequential decisions (Athey, 2017). Bias often derives from the data used to train these mod-els. For example, Buolamwini and Gebru (2018) show how facial recognition technologies perform worse for darker-skinned people, particularly darker-skinned women, due to the disproportionate presence of white, male faces in the training data. Natural language processing systems also inherit biases from the data they were trained on. For example, in unsupervised learning, word embeddings often contain biases (Bolukbasi et al., 2016;Caliskan et al., 2017;Garg et al., 2018) which persist even after attempts to remove them (Gonen and Goldberg, 2019). There are many examples of bias in supervised learning contexts: YouTube's captioning models make more errors when transcribing women (Tatman, 2017), AAE is more likely to be misclassified as non-English by widely used language classifiers (Blodgett and O'Connor, 2017), numerous gender and racial biases exist in sentiment classification systems (Kiritchenko and Mohammad, 2018), and errors in both co-reference resolution systems and occupational classification models reflect gendered occupational patterns (Zhao et al., 2018;De-Arteaga et al., 2019).
While hate speech and abusive language detection has become an important area for natural language processing research (Schmidt and Wiegand, 2017;Waseem et al., 2017;Fortuna and Nunes, 2018), there has been little work addressing the potential for these systems to be biased. The danger posed by bias in such systems is, however, particularly acute, since it could result in negative impacts on the same populations the systems are designed to protect. For example, if we mistakenly consider speech by a targeted minority group as abusive we might unfairly penalize the victim, but if we fail to identify abuse against them we will be unable to take action against the perpetrator. Although no model can perfectly avoid such problems, we should be particularly concerned about the potential for such models to be systematically biased against certain social groups, particularly protected classes.
A number of studies have shown that false positive cases of hate speech are associated with the presence of terms related to race, gender, and sexuality (Kwok and Wang, 2013;Burnap and Williams, 2015;. While not directly measuring bias, prior work has explored how annotation schemes  and the identity of the annotators (Waseem, 2016) might be manipulated to help to avoid bias. Dixon et al. (2018) directly measured biases in the Google Perspective API classifier, 1 trained on data from Wikipedia talk comments, finding that it tended to give high toxicity scores to innocuous statements like "I am a gay man". They called this "false positive bias", caused by the model overgeneralizing from the training data, in this case from examples where "gay" was used pejoratively. They find that a number of such "identity terms" are disproportionately represented in the examples labeled as toxic. Park et al. (2018) build upon this study, using templates to study gender differences in performance across two hate speech and abusive language detection datasets. They find that classifiers trained on these data tend to perform worse when female identity terms used, indicating gender bias in performance. We build upon this work by auditing a series of abusive language and hate speech detection datasets for racial biases. We evaluate how classification models trained on these datasets perform in the field, comparing their predictions for tweets written in language used by whites or African-Americans.

Hate speech and abusive language datasets
We focus on Twitter, the most widely used data source in abusive language research. We use all available datasets where tweets are labeled as various types of abuse and are written in English. We now briefly describe each of these datasets in chronological order. Waseem and Hovy (2016) collected 130k tweets containing one of seventeen different terms or phrases they considered to be hateful. They then annotated a sample of these tweets themselves, using guidelines inspired by critical race theory. These annotators were then reviewed by "a 25 year old woman studying gender studies and a nonactivist feminist" to check for bias. This dataset consists of 16,849 tweets labeled as either racism, sexism, or neither. Most of the tweets categorized as sexist relate to debates over an Australian TV show and most of those considered as racist are anti-Muslim.
To account for potential bias in the previous dataset, Waseem (2016) relabeled 2876 tweets in the dataset, along with a new sample from the tweets originally collected. The tweets were annotated by "feminist and anti-racism activists", based upon the assumption that they are domain-experts. A fourth category, racism and sexism was also added to account for the presence of tweets which exhibit both types of abuse. The dataset contains 6,909 tweets.  collected tweets containing terms from the Hatebase, 2 a crowdsourced hate speech lexicon, then had a sample coded by crowdworkers located in the United States. To avoid false positives that occurred in prior work which considered all uses of particular terms as hate speech, crowdworkers were instructed not to make their decisions based upon any words or phrases in particular, no matter how offensive, but on the overall tweet and the inferred context. The dataset consists of 24,783 tweets annotated as hate speech, offensive language, or neither. Golbeck et al. (2017) selected tweets using ten keywords and phrases related to anti-black racism, Islamophobia, homophobia, anti-semitism, and sexism. The authors developed a coding scheme to distinguish between potentially offensive content and serious harassment, such as threats or hate speech. After an initial round of coding, where tweets were assigned to a number of different categories, they simplified their analysis to include a binary harassment or non-harassment label for each tweet. The dataset consists of 20,360 tweets, each hand-labeled by the authors. 3 Founta et al. (2018) constructed a dataset intended to better approximate a real-world setting where abuse is relatively rare. They began with a random sample of tweets then augmented it by adding tweets containing one or more terms from the Hatebase lexicon and that had negative sentiment. They criticized prior work for defining labels in an ad hoc manner. To develop a more comprehensive annotation scheme they initially labeled a sample of tweets, allowing each tweet to belong to multiple classes. After analyzing the overlap between different classes they settled on a coding scheme with four distinct classes: abusive, hateful, spam, and normal. We use a dataset they published containing 91,951 tweets coded into these categories by crowdworkers. 4

Training classifiers
For each dataset we train a classifier to predict the class of unseen tweets. We use regularized logistic regression with bag-of-words features, a commonly used approach in the field. While we expect that we could improve predictive performance by using more sophisticated classifiers, we expect that any bias is likely a function of the training data itself rather than the classifier. Moreover, although features like word embeddings can work well for this task (Djuric et al., 2015) we wanted to avoid inducing any bias in our models by using pre-trained embeddings (Park et al., 2018). We pre-process each tweet by removing excess white-space and replacing URLs and mentions with placeholders. We then tokenize them, stem each token, and construct n-grams with a maximum length of three. Next we transform each dataset into a TF-IDF matrix, with a maximum of 10,000 features. We use 80% of each dataset to train models and hold out the remainder for validation. Each model is trained using stratified 5fold cross-validation. We conduct a grid-search over different regularization strength parameters to identify the best performing model. Finally, for each dataset we identify the model with the best average F1 score and retrain it using all of the training data. The performance of these models on the 20% held-out validation data is reported in Table 1. Overall we see varying performance across the classifiers, with some performing much better out-of-sample than others. In particular, we see that hate speech and harassment are particularly difficult to detect. Since we are primarily interested in within classifier, between corpora performance, any variation between classifiers should not impact our results.

Race dataset
We use a dataset of tweets labeled by race from Blodgett et al. (2016) to measure racial biases in these classifiers. They collected geolocated tweets in the U.S. and matched them with demographic data from the Census on the population of non-Hispanic whites, non-Hispanic blacks, Hispanics, and Asians in the block group where the tweets originated. They then identified words associated with particular demographics and trained a probabilistic mixed-membership language model. This model learns demographicallyaligned language models for each of the four demographic categories and is used to calculate the posterior proportion of language from each category in each tweet. Their validation analyses indicate that tweets with a high posterior proportion of non-Hispanic black language exhibit lexical, phonological, and syntactic variation consistent with prior research on AAE. Their publiclyavailable dataset contains 59.2 million tweets.
We define a user as likely non-Hispanic black if the average posterior proportion across all of their tweets for the non-Hispanic black language model is ≥ 0.80 (and ≤ 0.10 Hispanic and Asian combined) and as non-Hispanic white using the same formula but for the white language model. 5 This allows us to restrict our analysis to tweets written by users who predominantly use one of the language models. Due to space constraints we discard users who predominantly use either the Hispanic or the Asian language model. This results in a set of 1.1m tweets written by people who generally use non-Hispanic black language and 14.5m tweets written by users who tend to use non-Hispanic white language. Following Blodgett and O'Connor (2017), we call these datasets black-aligned and white-aligned tweets, reflecting 5 We use this threshold following Blodgett and O'Connor (2017) and after consulting with the lead author. While these cut-offs should provide high confidence that the users tend to use AAE or SAE, and hence serve as a proxy for race, it is important to note that not all African-Americans use AAE and that not all AAE users are African-American, although use of the AAE dialect suggests a social proximity to or affinity for African-American communities (Blodgett et al., 2016) the fact that they contain language associated with either demographic category but which may not all be produced by members of these categories. We now describe how we use these data in our experiments.

Experiments
We examine whether the probability that a tweet is predicted to belong to a particular class varies in relation to the racial alignment of the language it uses. The null hypothesis of no racial bias is that the probability a tweet will belong to a negative class is independent of the racial group the tweet's author is a member of. Formally, for class c i , where c i = 1 denotes membership in the class and c i = 0 the opposite, we aim to test H N : and the difference is statistically significant then we can reject the null hypothesis H N in favor of the alternative hypothesis H A that black-aligned tweets are classified into c i at a higher rate than white-aligned tweets. Conversely, if P (c i = 1|black) < P (c i = 1|white) we can conclude that the classifier is more likely to classify white-aligned tweets as c i . We should expect that white-aligned tweets are more likely to use racist language or hate speech than blackaligned tweets, given that African-Americans are often targeted with racism and hate speech by whites. However for some classes like sexism we have no reason to expect there to be racial differences in either direction.
To test this hypothesis we use bootstrap sampling (Efron and Tibshirani, 1986) to estimate the proportion of tweets in each dataset that each classifier predicts to belong to each class. We draw n random samples with replacement of k tweets from each of the two race corpora, where n = k = 1000. For each sample we use each classifier to predict the class membership of each tweet, then store the proportion of tweets that were assigned to each class, p i . For each classifier-class pair, we thus obtain a pair of vectors, one for each corpus, each containing n sampled proportions. The bootstrap estimates for the proportion of tweets belonging to class i for each group, p i black and p i white , are calculated by taking the mean of the elements in each vector: 1 n n j=1 p ij . We then use a t-test to test whether p i black = p i white . We also calculate the ratio  dicate that black-aligned tweets are classified as belonging to class i at a higher rate than whitealigned tweets.
We also conduct a second experiment, where we assess whether there is racial bias conditional upon a tweet containing a keyword likely to be associated with a negative class. While differences in language will undoubtedly remain, this should help to account for the possibility that results in Experiment 1 are driven by differences in the true distribution of the different classes of interest, or of words associated with these classes, in the two corpora. For classifier c and category i, we evaluate H N : P (c i = 1|black, t) = P (c i = 1|white, t) for a given term t. We conduct this experiment for two different terms, each of which occurs frequently enough in the data to enable our bootstrapping approach. We select the term "n*gga", since it is a particularly prevalent source of false positives for hate speech detection (Kwok and Wang, 2013;Waseem et al., 2018). 6 In this case, we expect that tweets containing the word should be classified as more negative when used by whites, thus The other alternative, H A 2 : P (c i = 1|black, t) > P (c i = 1|white, t) would indicate that blackaligned tweets containing the term are penalized at a higher rate than comparable white-aligned tweets. We also assess the results for the word "b*tch" since it is a widely used sexist term, which 6 We also planned to conduct the same analysis using the "-er" suffix, however the sample was too small, with the word being used in 555 tweets in the white-aligned corpus (0.004%) and 61 in the black-aligned corpus (0.005%).
is often also used casually, but we have no theoretical reason to expect there to be racial differences in its usage. The term "n*gga" was used in around 2.25% of black-aligned and 0.15% of white-aligned tweets. The term "b*tch" was used in 1.7% of black-aligned and 0.5% of whitealigned tweets. The substantial differences in the distributions for these two terms alone are consistent with our intuition that some of the results in Experiment 1 may be driven by differences in the frequencies of words associated with negative classes in the training datasets. Since we are using a subsample of the available data, we use smaller bootstrap samples, drawing k = 100 tweets each time.

Results
The results of Experiment 1 are shown in Table 2. We observe substantial racial disparities in the performance of all classifiers. In all but one of the comparisons, there are statistically significant (p < 0.001) differences and in all but one of these we see that tweets in the black-aligned corpus are assigned negative labels more frequently than those by whites. The only case where blackaligned tweets are classified into a negative class less frequently than white-aligned tweets is the racism class in the Waseem and Hovy (2016) classifier. Note, however, the extremely low rate at which tweets are predicted to belong to this class for both groups. On the other hand, this classifier is 1.7 times more likely to classify tweets in the black-aligned corpus as sexist. For Waseem (2016) we see that there is no significant difference in the estimated rates at which tweets are clas-sified as racist across groups, although the rates remain low. Tweets in the black-aligned corpus are classified as containing sexism almost twice as frequently and 1.1 times as frequently classified as containing racism and sexism compared to those in the white-aligned corpus. Moving onto , we find large disparities, with around 5% of tweets in the black-aligned corpus classified as hate speech compared to 2% of those in the white-aligned set. Similarly, 17% of black-aligned tweets are predicted to contain offensive language compared to 6.5% of whitealigned tweets. The classifier trained on the Golbeck et al. (2017) dataset predicts black-aligned tweets to be harassment 1.4 times as frequently as white-aligned tweets. The Founta et al. (2018) classifier labels around 11% of tweets in the blackaligned corpus as hate speech and almost 18% as abusive, compared to 6% and 8% of white-aligned tweets respectively. It also classifies black-aligned tweets as spam 1.8 times as frequently.
The results of Experiment 2 are consistent with the previous results, although there are some notable differences. In most cases the racial disparities persist, although they are generally smaller in magnitude and in some cases the direction even changes. Table 3 shows that for tweets containing the word "n*gga", classifiers trained on Waseem and Hovy (2016) and Waseem (2016) are both predict black-aligned tweets to be instances of sexism approximately 1.5 times as often as white-aligned tweets. The classifier trained on the  data is significantly less likely to classify black-aligned tweets as hate speech, although it is more likely to classify them as offensive. Golbeck et al. (2017) classifies black-aligned tweets as harassment at a higher rate for both groups than in the previous experiment, although the disparity is narrower. For the Founta et al. (2018) classifier we see that black-aligned tweets are slightly less frequently considered to be hate speech but are much more frequently classified as abusive.
The results for the second variation of Experiment 2 where we conditioned on the word "b*tch" are shown in Table 4. We see similar results for Waseem and Hovy (2016) and Waseem (2016). In both cases the classifiers trained upon their data are still more likely to flag black-aligned tweets as sexism. The Waseem and Hovy (2016) classifier is particularly sensitive to the word "b*tch" with 96% of black-aligned and 94% of white-aligned tweets predicted to belong to this class. For  almost all of these tweets are classified as offensive, however those in the blackaligned corpus are 1.15 times as frequently classified as hate speech. We see a very similar result for Golbeck et al. (2017) compared to the previous experiment, with black-aligned tweets flagged as harassment at 1.1 times the rate of those in the white-aligned corpus. Finally, for the Founta et al. (2018) classifier we see a substantial racial disparity, with black-aligned tweets classified as hate speech at 2.7 times the rate of white aligned ones, a higher rate than in Experiment 1.

Discussion
Our results demonstrate consistent, systematic and substantial racial biases in classifiers trained on all five datasets. In almost every case, black-aligned tweets are classified as sexism, hate speech, harassment, and abuse at higher rates than whitealigned tweets. To some extent, the results in the first experiment may be driven by underlying differences in the rates at which speakers of different dialects use particular words and phrases associated with these negative classes in the training data. For example, the word "n*gga" appears fifteen times as frequently in the black-aligned corpus compared to the white-aligned corpus. 7 However, the second experiment shows that these disparities tend to persist even when comparing tweets containing keywords likely to be associated with negative classes. While some of the remaining disparities are likely due to differences in the distributions of other keywords we did not condition on, we expect that other more innocuous aspects of black-aligned language may be associated with negative labels in the training data, leading classifiers to disproportionately predict that tweets by African-Americans belong to negative classes. We now discuss the results as they pertain to each of the datasets used.
Classifiers trained on data from Waseem and Hovy (2016) and Waseem (2016) only predicted a small fraction of the tweets to be racism. We suspect that this is due to the composition of their dataset, since the majority of the racist training examples consist of anti-Muslim rather than anti-   Table 4: Experiment 2, t = "b*tch" black language. Across both datasets the words "n*gger" and "n*gga" appear in 4 and 10 tweets respectively. Looking at the sexism class on the other hand, we see that both models were consistently classifying tweets in the black-aligned corpus as sexism at a substantially higher rate than those in the white-aligned corpus. Given this result, and the gender biases identified in these data by Park et al. (2018), it not apparent that the purportedly expert annotators were any less biased than amateur annotators (Waseem, 2016).
The classifier trained on  shows the largest disparities in Experiment 1, with tweets in the black-aligned corpus classified as hate speech and offensive language at substantially higher rates than white-aligned tweets. We expect that this result occurred for two reasons. First, the dataset contains a large number of cases where AAE is used (Waseem et al., 2018). Second, many of the AAE tweets also use words like "n*gga" and "b*tch", and are thus frequently associated with the hate speech and offensive classes, resulting in "false positive bias" (Dixon et al., 2018). On the other hand, the distinction between hate speech and offensive language appears to hold up to scrutiny: while a large proportion of tweets in Experiment 2 containing the word "n*gga" are classified as hate speech, the rate is substantially higher for white-aligned tweets. Without this category we expect that many of the tweets classified as offensive would instead be mistakenly classified as hate speech.
Turning to the Golbeck et al. (2017) classifer we found that tweets in the black-aligned dataset were significantly more likely to be classified as harassment in all experiments, although the disparity decreased substantially after conditioning on certain keywords. It seems likely that their simple binary labelling scheme may not be sufficient to capture the variation in language used, resulting in high rates of false positives.
Finally, Founta et al. (2018) is the largest and perhaps the most comprehensive of the available datasets. In Experiment 1 we see that this clas-sifier has the second highest rates of racial disparities, classifying black-aligned tweets as hate speech, abusive, and spam at substantially higher rates than white-aligned tweets. In Experiment 2 the classifier is slightly less likely to classify black-aligned tweets containing the word "n*gga" as hate speech but is 2.7 times more likely to predict that black-aligned tweets using "b*tch" belong to this category.

Conclusion
Our study is the first to measure racial bias in hate speech and abusive language detection datasets. We find evidence of substantial racial bias in all of the datasets tested. This bias tends to persist even when comparing tweets containing certain relevant keywords. While these datasets are still valuable for academic research, we caution against using them in the field to detect and particularly to take enforcement action against different types of abusive language. If they are used in this way we expect that they will systematically penalize African-Americans more than whites, resulting in racial discrimination. We have not evaluated these datasets for bias related to other ethnic and racial groups, nor other protected categories like gender and sexuality, but expect that such bias is also likely to exist. We recommend that efforts to measure and mitigate bias should start by focusing on how bias enters into datasets as they are collected and labeled. In particular, future work should focus on the following three areas.
First, we expect that some biases emerge at the point of data collection. Some studies sampled tweets using small, ad hoc sets of keywords created by the authors (Waseem and Hovy, 2016;Waseem, 2016;Golbeck et al., 2017), an approach demonstrated to produce poor results (King et al., 2017). Others start with large crowdsourced dictionaries of keywords, which tend to include many irrelevant terms, resulting in high rates of false positives Founta et al., 2018). In both cases, by using keywords to identify relevant tweets we are likely to get non-representative samples of training data that may over-or under-represent certain communities. In particular, we need to consider whether the linguistic markers we use to identify potentially abusive language may be associated with language used by members of protected categories. For example, although  started with thousands of terms from the Hatebase lexicon, AAE is over-represented in the dataset (Waseem et al., 2018) because some keywords associated with this speech community were used more frequently on Twitter than other keywords in the lexicon and were consequentially over-sampled.
Second, we expect that the people who annotate data have their own biases. Since individual biases in reflect societal prejudices, they aggregate into systematic biases in training data. The datasets considered here relied upon a range of different annotators, from the authors (Golbeck et al., 2017;Waseem and Hovy, 2016) and crowdworkers Founta et al., 2018) to activists (Waseem, 2016). Even the classifier trained on expert-labeled data (Waseem, 2016) flags black-aligned tweets as sexist at almost twice the rate of white-aligned tweets. While we agree that there is value in working with domain-experts to annotate data, these results suggest that activists may be prone to similar biases as academics and crowdworkers. Further work is therefore necessary to better understand how to integrate expertise into the process and how training can be used to help to mitigate bias. We also need to consider how sociocultural context influences annotators' decisions. For example, 48% of the workers employed by Founta et al. (2018) were located in Venezuela but the authors did not consider whether this affected their results (or if the annotators understood English sufficiently for the task).
Third, we observed substantial variation in the rates of class membership across classifiers and datasets. In Experiment 1 the rate at which tweets were assigned to negative classes varied from 1% to 18%. Some of the low proportions may indicate a preponderance of false negatives due to a lack of training data, suggesting that these models may not be able to sufficiently generalize beyond the data they were trained on. The high proportions may signal too many false positives, which may a result of the over-sampling of abusive language in labeled datasets. Founta et al. (2018) claim that, on average, between 0.1% and 3% of tweets are abusive, depending upon the category of abuse. Identifying such content is therefore a highly imbalanced classification problem. When labeling datasets and evaluating our models we must pay more attention to the baseline rates of usage of different types of abusive language and how they may vary across populations (Silva et al., 2016).
Finally, we need to more carefully consider how contextual factors interact with linguistic subtleties and our definitions of abuse. The "n-word" is a particularly useful illustration of this issue. It exhibits polysemy, as it can be extremely racist or quotidian, depending on the speaker, the context, and the spelling. While the history of the word and its usages is too complex to be summarized here (Neal, 2013), when used with the "er" suffix it is generally considered to be a racist ephiphet, associated with white supremacy. Prior work has confirmed that the use of this variant online is generally considered to be hateful (Kwok and Wang, 2013), although not always the case, for example when a victim of abuse shares an insult they have received. However the variant with the "-a" suffix is typically used innocuously by African-Americans (Kwok and Wang, 2013), indeed our results indicate that it is used far more frequently in black-aligned tweets (although it is still used by many white people). 8 Despite this distinction, some studies have considered this variant to be hateful (Silva et al., 2016;Alorainy et al., 2018). This approach results in high rates of false positive cases of hate speech, thus  included a class for offensive language which does not appear to be hateful and let annotators decide which class tweets belonged to based upon their interpretation of the context, many of whom labeled tweets containing the term as offensive. Waseem et al. (2018) criticized this decision, claiming that it is problematic to ever consider the word to be offensive due to its widespread use among AAE speakers. This critique appears to be reasonable in the sense that we should not penalize African-Americans for using the word, but it avoids grappling with how to act when the word is used by other speakers and in other contexts. What should be done if it is used by a white social media user in reference to a black user? How should the context of their interaction and the nature of their relationship affect our decision?
A "one-size-fits-all", context-independent approach to defining and detecting abusive language is clearly inappropriate. Different communities have different speech norms, such that a model suitable for one community may discriminate against another. However there is no con-sensus in the field on how and if we can develop detection systems sensitive to different social and cultural contexts. In addition to our recommendations for improving training data, we emphasize the necessity of considering how context matters and how detection systems will have uneven effects across different communities.

Limitations
First, while the Blodgett et al. (2016) dataset is the best available source of tweets labeled as AAE, we do not have ground truth labels for the racial identities of the authors. By filtering on users who predominantly used one type of language we may also miss users who may frequently codeswitch between AAE and SAE. Second, although we roughly approximate this in Experiment 2, we cannot rule out the possibility that the results, rather than evidence of bias, are a function of different distributions of negative classes in the corpora studied. It is possible that words associated with negative categories in our abusive language datasets are also used to predict race by Blodgett et al. (2016), potentially contributing to the observed disparities. To more thoroughly investigate this issue we therefore require ground truth labels for abuse and race. Third, the results may vary for different classifiers or feature sets. It is possible that more sophisticated modeling approaches could enable us to alleviate bias, although they could also exacerbate it. Fourth, we did not interpret the results of the classifiers to determine why they made particular predictions. Further work is needed to identify what features of AAE the classifiers are learning to associate with negative classes. Finally, this study has only focused on one dimension of racial bias. Further work is necessary to assess the degree to investigate the extent to which data and models are biased against people belonging to other protected categories.