Hate Speech Dataset from a White Supremacy Forum

Hate speech is commonly defined as any communication that disparages a target group of people based on some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristic. Due to the massive rise of user-generated web content on social media, the amount of hate speech is also steadily increasing. Over the past years, interest in online hate speech detection and, particularly, the automation of this task has continuously grown, along with the societal impact of the phenomenon. This paper describes a hate speech dataset composed of thousands of sentences manually labelled as containing hate speech or not. The sentences have been extracted from Stormfront, a white supremacist forum. A custom annotation tool has been developed to carry out the manual labelling task which, among other things, allows the annotators to choose whether to read the context of a sentence before labelling it. The paper also provides a thoughtful qualitative and quantitative study of the resulting dataset and several baseline experiments with different classification models. The dataset is publicly available.


Introduction
The rapid growth of content in social networks such as Facebook, Twitter and blogs, makes it impossible to monitor what is being said. The increase of cyberbullying and cyberterrorism, and the use of hate on the Internet, make the identification of hate in the web an essential ingredient for anti-bullying policies of social media, as Facebook's CEO Mark Zuckerberg recently acknowledged 1 . This paper releases a new dataset of hate speech to further investigate the problem.
Although there is no universal definition for hate speech, the most accepted definition is provided by Nockleby (2000): "any communication that disparages a target group of people based on some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristic". Consider the following 2 : (1) "God bless them all, to hell with the blacks" This sentence clearly contains hate speech against a target group because of their skin colour. However, the identification of hate speech is often not so straightforward. Besides defining hate speech as a verbal abuse directed to a group of people because of specific characteristics, other definitions of hate speech in previous studies care to include the speaker's determination to inflect harm (Davidson et al., 2017).
In all, there seems to be a pattern shared by most of the literature consulted (Nockleby, 2000;Djuric et al., 2015;Gitari et al., 2015;Nobata et al., 2016;Silva et al., 2016;Davidson et al., 2017), which would define hate speech as a) a deliberate attack, b) directed towards a specific group of people, and c) motivated by actual or perceived aspects that form the group's identity.
This paper presents the first public dataset of hate speech annotated on Internet forum posts in English at sentence-level. The dataset is publicly available in GitHub 3 . The source forum is Stormfront 4 , the largest online community of white nationalists, characterised by pseudo-rational discussions of race (Meddaugh and Kay, 2009), which include different degrees of offensiveness. Storm-front is known as the first hate website (Schafer, 2002).
The rest of the paper is structured as follows: Section 2 describes the related work and contextualises the work presented in the paper; Section 3 introduces the task of generating a manually labelled hate speech dataset; this includes the design of the annotation guidelines, the resulting criteria, the inter-annotator agreement and a quantitative description of the resulting dataset; next, Section 4 presents several baseline experiments with different classification models using the labelled data; finally, Section 5 provides a brief discussion about the difficulties and nuances of hate speech detection, and Section 6 summarises the conclusions and future work.

Related Work
Research on hate speech has increased in the last years. The conducted studies are diverse and work on different datasets; there is no official corpus for the task, so usually authors collect and label their own data. For this reason, there exist few publicly available resources for hate speech detection.
Hatebase 5 is the an online repository of structured, multilingual, usage-based hate speech. Its vocabulary is classified into eight categories: archaic, class, disability, ethnicity, gender, nationality, religion, and sexual orientation. Some studies make use of Hatebase to build a classifier for hate speech (Davidson et al., 2017;Serra et al., 2017;Nobata et al., 2016).
However, Saleem et al. (2016) prove that keyword-based approaches succeed at identifying the topic but fail to distinguish hateful sentences from clean ones, as the same vocabulary is shared by the hateful and target community, although with different intentions.
Kaggle's Toxic Comment Classification Challenge dataset 6 consists of 150k Wikipedia comments annotated for toxic behaviour. Waseem and Hovy (2016) published a collection of 16k tweets classified into racist, sexist or neither. Sharma et al. (2018) collected a set of 9k tweets containing harmful speech and they manually annotated them based on their degree of hateful intent. They describe three different classes of hate speech. The definition on which this paper is based overlaps mostly with their Class I, described as speech a) that incites violent actions, b) directed at a particular group, and c) with the intention of conveying hurting sentiments.
Google and Jigsaw developed a tool called Perspective 7 that measures the "toxicity" of comments. The tool is published as an API and gives a toxicity score between 0 and 100 using a machine learning model. Such model has been trained on thousands of comments manually labelled by a team of people 8 ; to our knowledge, the resulting dataset is not publicly available.
The detection of hate speech has been tackled in three main different ways. Some studies focus on subtypes of hate speech. This is the case of Warner and Hirschberg (2012), who focus on the identification of anti-Semitic posts versus any other form of hate speech. Also in this line, Kwok and Wang (2013) target anti-black hate speech. Badjatiya et al. (2017); Gambäck and Sikdar (2017) study the detection of racist and sexist tweets using deep learning.
Other proposals focus on the annotation of hate speech as opposed to texts containing derogatory or offensive language (Davidson et al., 2017;Zampieri, 2017, 2018;Watanabe et al., 2018). They build multi-class classifiers with the categories "hate", "offensive", and "clean".
Finally, some studies focus on the annotation of hate speech versus clean comments that do not contain hate speech (Nobata et al., 2016;Burnap and Williams, 2015;Djuric et al., 2015). Gitari et al. (2015) follow this approach but further classify the hateful comments into two categories: "weak" and "strong" hate. Del Vigna et al. (2017) conduct a similar study for Italian.
In all, experts conclude that annotation of hate speech is a difficult task, mainly because of the data annotation process. Waseem (2016) conducted a study on the influence of annotator knowledge of hate speech on classifiers for hate speech. Ross et al. (2016) also studied the reliability of hate speech annotations and acknowledge the importance of having detailed instructions for the annotation of hate speech available.
This paper aims to tackle the inherent subjectivity and difficulty of labelling hate speech by following strict guidelines. The approach presented in this paper follows (Nobata et al., 2016;Burnap and Williams, 2015;Djuric et al., 2015) (i.e., "hateful" versus "clean"). Furthermore, the annotation has been performed at sentence level as opposed to full-comment annotation, with the possibility to access the original complete post for each sentence. To our knowledge, this is the first work that releases a manually labelled hate speech dataset annotated at sentence level in English posts from a white supremacy forum.

Hate Speech Dataset
This paper presents the first dataset of textual hate speech annotated at sentence-level. Sentence-level annotation allows to work with the minimum unit containing hate speech and reduce noise introduced by other sentences that are clean.
A total number of 10,568 sentences have been extracted from Stormfront and classified as conveying hate speech or not, and into two other auxiliary classes, as per the guidelines described in Section 3.2. In addition, the following information is also given for each sentence: a post identifier and the sentence's position in the post, a user identifier, a sub-forum identifier 9 . This information makes it possible re-build the conversations these sentences belong to. Furthermore, the number of previous posts the annotator had to read before making a decision over the category of the sentence is also given.

Data extraction and processing
The content was extracted from Stormfront using web-scraping techniques and was dumped into a database arranged by sub-forums and conversation threads (Figea et al., 2016). The extracted forum content was published between 2002 and 2017. The process of preparing the candidate content to be annotated was the following: 1. A subset of 22 sub-forums covering diverse topics and nationalities was random-sampled to gather individual posts uniformly distributed among sub-forums and users. 2. The sampled posts were filtered using an automatic language detector 10 to discard non-English texts. 9 All the identifiers provided are fake placeholders that facilitate understanding relations between sentences, Stormfront users, etc., but do not point back to the original source. 10 https://github.com/shuyo/languagedetection/blob/wiki/ProjectHome.md 3. The resulting posts were segmented into sentences with ixa-pipes (Agerri et al., 2014). 4. The sentences were grouped forming batches of 500 complete posts (∼ 1,000 sentences per batch).
The manual annotation task was divided into batches to control the process. During the annotation of the first two batches, the annotation procedure and guidelines were progressively refined and adapted. In total, 10,568 sentences contained in 10 batches have been manually annotated.
A post-processing step was performed to filter excessively long or short sentences. The cleansing process removed sentences shorter than 3 words or longer than 50 words. In total, 652 sentences were discarded, which represent 6.17% of the original dataset. The resulting dataset (henceforth referred to as the "clean" dataset) is the one described in Section 3.4 and used for experimentation in Section 4. The release of the dataset will contain both the raw sentences without any post-processing and the clean version, both annotated.

Annotation guidelines
Schmidt and Wiegand (2017) acknowledge that the procedure for hate speech annotation is fairly vague in previous studies, which translates into low agreement scores. In this study, all the annotators together created and discussed the guidelines to ensure all participants had the same understanding of hate speech. The final guidelines consider 4 types of sentences:

HATE
Sentences in this category contain hate speech. Hate speech is a a) deliberate attack b) directed towards a specific group of people c) motivated by aspects of the group's identity.
The three premises must be true for a sentence to be categorized as HATE. Consider the following examples: (2) "Poor white kids being forced to treat apes and parasites as their equals." (3) "Islam is a false religion however unlike some other false religions it is crude and appeals to crude people such as arabs." In (2), the speaker uses "apes" and "parasites" to refer to children of dark skin and implies they are not equal to "white kids". That is, it is an attack to the group composed of children of dark skin based on an identifying characteristic, namely, their skin colour. Thus, all the premises are true and (2) is a valid example of HATE. Example (3) brands all people of Arab origin as crude. That is, it attacks the group composed of Arab people based on their origin. Thus, all the premises are true and (3) is a valid example of HATE.

NOHATE
This label is used to categorise sentences that do not convey hate speech per the established definition. Consider the following examples: (4) "Where can I find NS speeches and music, also historical, in mp3 format for free download on the net." (5) "I know of Chris Rock and subsequently have hated him for a long time." Example (4) mentions National Socialism ("NS"), but the user is just interested in documentation about it. Therefore, the sentence itself is not an attack, i.e., premise a) is not true, despite the sound assumption that the speaker forms part of a hating community. Thus, (4) is not a valid instance of HATE. Example (5) is directed towards an individual; thus, premise b) is false and the sentence is not a valid example of HATE, despite the sound assumption that the attack towards the individual is based on his skin colour. Finally, it must be emphasized that the presence of pejorative language in a sentence cannot systematically be considered sufficient evidence to confirm the existence of hate speech. The use of "fag" in the following sentence: (6) "Two black fag's holding hands." cannot be said to be a deliberate attack, taken without any more context, despite it likely being offensive. Therefore, it cannot be considered HATE.

RELATION
When (6) (repeated as (7.1)) is read in context: (7.1) "Two black fag's holding hands." (7.2) "That's Great!" (7.3) "That's 2 blacks won't be having kids." it clearly conveys hate speech. The author is celebrating that two people belonging to the black minority will not be having children, which is a deliberate attack on a group of people based on an identifying characteristic. The annotation at sentencelevel fails to discern that there exists hate speech in this example. The label RELATION is for specific cases such as this, where the sentences in a post do not contain hate speech on their own, but the combination of several sentences does. Consider another example: (8.1) "Probably the most disgusting thing I've seen in the last year." (8.2) "She looks like she has some African blood in her, or maybe it's just the makeup." (8.3) "This is just so wrong." Each sentence in isolation does not convey hate speech: in (8.1) and (8.3), a negative attitude is perceived, but it is unknown whether it is targeted towards a group of people; in (8.2), there is no hint of an attack, not even of a negative attitude. However, the three sentences together suggest that having "African blood" makes a situation (whatever "this" refers to) disgusting, which constitutes hate speech according to the definition proposed.
The label RELATION is given separately to all the sentences that need each other to be understood as hate speech. That is, consecutive sentences with this label convey hate speech but depend on each other to be correctly interpreted.

SKIP
Sentences that are not written in English or that do not contain information as to be classified into HATE or NOHATE are given this label.

Annotation procedure
In order to develop the annotation guidelines, a draft was first written based on previous similar work. Three of the authors annotated a 1,144sentence batch of the dataset following the draft, containing only the categories HATE, NOHATE and SKIP. Then, they discussed the annotations and modified the draft accordingly, which resulted in the guidelines presented in the previous section, including the RELATION category. Finally, a different batch of 1,018 sentences was annotated by the same three authors adhering to the new guidelines in order to calculate the inter-annotator agreement. Table 1 shows the agreements obtained in terms of the average percent agreement (avg %), average Cohen's kappa coefficient (Cohen, 1960) (avg k), and Fleiss' kappa coefficient (Fleiss, 1971) (f leiss). The number of annotated sentences (# sent) and the number of categories to label (# cat) are also given for each batch. The results are in line with similar works (Nobata et al., 2016;Warner and Hirschberg, 2012 All the annotation work was carried out using a web-based tool developed by the authors for this purpose. The tool displays all the sentences belonging to the same post at the same time, giving the annotator a better understanding of the post's author's intention. If the complete post is deemed insufficient by the annotator to categorize a sentence, the tool can show previous posts to which the problematic post is answering, on demand, up to the first post in the thread and its title. This consumption of context is registered automatically by the tool for further treatment of the collected data. As stated by other studies, context appears to be of great importance when annotating hate speech (Watanabe et al., 2018). Schmidt and Wiegand (2017) acknowledge that whether a message contains hate speech or not can depend solely on the context, and thus encourage the inclusion of extralinguistic features for annotation of hate speech. Moreover, Sharma et al. (2018) claim that context is essential to understand the speaker's intention.

Dataset statistics
This section provides a quantitative description and statistical analysis of the clean dataset published. Table 2 shows the distribution of the sentences over categories. The dataset is unbalanced as there exist many more sentences not conveying hate speech than 'hateful" ones.  Table 3 refers to the subset of sentences that have required reading additional context (i.e. previous comments to the one being annotated) to make an informed decision by the human annotators. The category HATE is the one that requires more context, usually due to the use of slang unknown to the annotator or because the annotator needed to find out the actual target of an offensive mention.  The remaining of the section focuses only on the subset of the dataset composed of the categories HATE and NOHATE, which are the core of this work. Table 4 shows the size of said subset, along with the average sentence length for each class, their word counts and their vocabulary sizes.  Regarding the distribution of sentences over Stormfront accounts, the dataset is balanced as there is no account that contributes notably more than any other: the average percentage of sentences is of 0.50 ± 0.42 per account, the total amount of accounts in the dataset being 2,723. The sub-forums that contain more HATE belong to the category of news, discussion of views, politics, philosophy, as well as to specific countries (i.e., Ireland, Britain, and Canada). In contrast, the subforums that contain more NOHATE sentences are about education and homeschooling, gatherings, and youth issues.
In order to obtain a more qualitative insight of the dataset, a HATE score (HS) has been calculated based on the Pointwise Mutual Information (PMI) value for each word towards the categories HATE and NOHATE. PMI allows calculating the correlation of each word with respect to each category. The difference of the PMI value of a word w and the category HATE and the PMI of the same word w and the category NOHATE results in the HATE score of w, as shown in Formula 1.
Intuitively, this score is a simple way of capturing whether the presence of a word in a HATE context occurs significantly more often than in a NO-HATE context. Table 5 shows the 15 most and least hateful words: the more positive a HATE score, the more hateful a word, and vice versa.  The results show that the most hateful words are derogatory or refer to targeted groups of hate speech. On the other hand, the least hateful words are neutral in this regard and belong to the semantic fields of Internet, or temporal expressions, among others. This shows that the vocabulary is discernible by category, which in turn suggests that the annotation and guidelines are sound.
Performing the same calculation with bi-grams yields expressions such as "gene pool", "race traitor", and "white guilt" for the most hateful category, which appear to be concepts related to race issues. The less hateful terms are expressions such as "white power", "white nationalism" and "pro white", which clearly state the right-wing extremist politics of the forum users.
Finally, the dataset has been contrasted against the English vocabulary in Hatebase. 9.28% of HATE vocabulary overlaps with Hatebase, a higher percentage than for NOHATE vocabulary, of which 6.57% of the words can be found in Hatebase. In Table 6, the distribution of HATE vocabulary is shown over Hatebase's 8 categories. Although some percentages are not high, all 8 categories are present in the corpus. Most of the HATE words from the dataset belong to ethnicity, followed by gender. This is in agreement with Silva et al. (2016), who conducted a study to analyse the targets of hate in social networks and showed that hate based on race was the most common.

Experiments
In order to further inspect the resulting dataset and to check the validity of the annotations (i.e. whether the two annotated classes are separable based solely on the text of the labelled instances) a set of baseline experiments have been conducted. These experiments do not exploit any external resource such as lexicons, heuristics or rules. The experiments just use the provided dataset and well-known approaches from the literature to provide a baseline for further research and improvement in the future.

Experimental setting
The experiments are based on a balanced subset of labelled sentences. All the sentences labelled as HATE have been collected, and an equivalent number of NOHATE sentences have been randomly sampled, summing up 2k labelled sentences. From this amount, the 80% has been used for training and the remaining 20% for testing. The evaluated algorithms are the following: • Support Vector Machines (SVM) (Hearst et al., 1998) over Bag-of-Words vectors.
Word-count-based vectors have been computed and fed into a Python Scikitlearn LinearSVM 11 classifier to separate HATE and NOHATE instances.
• Convolutional Neural Networks (CNN), as described in (Kim, 2014). The implementation is a simplified version using a single input channel of randomly initialized word embeddings 12 .
A LSTM layer of size 128 over word embeddings of size 300.
All the hyperparameters are left to the usual values reported in the literature (Greff et al., 2017). No hyperparameter tuning has been performed. A more comprehensive experimentation and research has been left for future work.

Results
The baseline experiments include a majority class baseline showing the balance between the two classes in the test set. The results are given in terms of accuracy for HATEand NOHATE individually, and the overall accuracy, calculated according to the equations 2, 3 and 4, where TP are the true positives and FP are the false positives.
We show the accuracy for the both complementary classes instead of the precision-recall of a single class to highlight the performance of the classifiers for the both classes individually. Table 7 shows the results of using only sentences that did not require additional context to be labelled, while Table 8 shows the results of including those sentences that required additional context. Not surprisingly, the results are lower when including sentences that required additional context. If a human annotator required additional information to make a decision, it is to expect that an automatic classifier would not have enough information or would have a harder time making a correct prediction. The results also show that NOHATE sentences are more accurately classified than HATE sentences. Overall, the LSTM-based classifier obtains better results, but even the simple SVM using bag-of-words vectors is capable of discriminating the classes reasonably well.

Error Analysis
In order to get a deeper understanding of the performance of the classifiers trained, a manual inspection has been performed on a set of erroneously classified sentences. Two main types of errors have been identified: Type I errors Sentences manually annotated as HATE but classified as NOHATE by the system, usually due to a lack of context or to a lack of the necessary world knowledge to understand the meaning of the sentence: (11) "Indeed, now they just need to feed themselves, educate themselves, police themselves ad nauseum...' (12) "If you search around you can probably find "hoax of the 20th century" for free on the net." In (11), it is not clear without additional context who "themselves" are. It actually refers to people of African origin. In its original context, the author was implying that they are not able to feed, police nor educate themselves. This would make the sentence an example of hate speech, but it could also be a harmless comment given the appropriate context. In (12), the lack of world knowledge about what the Holocaust is, or what naming it "hoax" implies -i.e., denying its existence-, would make it difficult to understand the sentence as an act of hate speech.
Type II errors Sentences manually labelled as NOHATE and automatically classified as HATE, usually due to the use of common offensive vocabulary with non-hateful intent: (13) "I dont like reporting people but the last thing I will do is tolerate some stupid pig who claims Hungarians are Tartars." (14) "More black-on-white crime: YouTube -Black Students Attack White Man For Eating Dinner With Black" In (13), the user accuses and insults a particular individual. Example (14) is providing information on a reported crime. Although vocabulary of targeted groups is used in both cases (i.e., "Hungarians", "Tartars", "black"), the sentences do not contain HATE.

Discussion
There are several aspects of the introduced dataset, and hate speech annotation in general, that deserve a special remark and discussion.
First, the source of the content used to obtain the resulting dataset is on its own a source of offensive language. Being Stormfront a white supremacists' forum, almost every single comment contains some sort of intrinsic racism and other hints of hate. However, not every expression that contains a racist cue can be directly taken as hate speech. This is a truly subjective debate related to topics such as free speech, tolerance and civics. That is one of the main reasons why this paper carefully describes the annotation criteria for what here counts as hate speech and what not. In any case, despite the efforts to make the annotation guidelines as clear, rational and comprehensive as possible, the annotation process has been admittedly demanding and far from straightforward.
In fact, the annotation guidelines were crafted in several steps, first paying attention to what the literature points about hate speech annotation. After a first round of manual labelling, inconsistencies among the human annotators were discussed and the guidelines and examples were adapted. From those debates we extract some conclusions and pose several open questions. The first annotation criteria (hate speech being a deliberate attack) still lacks robustness and a proper definition, becoming ambiguous and subject to different interpretations. A more precise definition of what an attack is and what it is not would be necessary: Can an objective fact that however undermines the honour of a group of people be considered an attack? Is the mere use of certain vocabulary (e.g. "nigger") automatically considered an attack? With regard to the second annotation criteria (hate speech being directed towards a specific group of people), it was controversial among the human annotators as well. Sentences were found that attacked individuals and mentioned the targets' skin colour or religion, political trends, and so on. Some annotators interpreted these as indirect attacks towards the collectivity of people that share the mentioned characteristics.
Another relevant point is the fact that the annotation granularity is sentence level. Most, if not all, of the existing datasets label full comments. A comment might be part of a more elaborated discourse, and not every part may convey hate. It is arguable whether a comment containing a single hate-sentence can be considered "hateful" or not. The dataset released provides the full set of sentences per comment with their annotations, so each can decide how to work with it.
In addition, and related to the last point, one of the labels included for the manual labelling is RELATION. This label is meant to be used when two or more sentences need each other to be understood as hate speech, usually because one is a premise and the following is the (hateful) conclusion. This label has been seldom used.
Finally, a very important issue to consider is the need of additional context to label a sentence (i.e., the rest of the conversation or the title of the forum-thread). It can happen to human annotators and, of course, to automatic classifiers, as confirmed in the error analysis (Section 4.3). Studying context dependency to perform the labelling, it has been observed that annotators learn to distinguish hate speech more easily over time, requiring less and less context to make the annotations (see Figure 1).

Conclusions and Future Work
This paper describes a manually labelled hate speech dataset obtained from Stormfront, a white supremacist online forum.
The resulting dataset consists of ∼10k sentences labelled as conveying hate speech or not. Since the definition of hate speech has many subtleties, this work includes a detailed explanation of the manual annotation criteria and guidelines. Furthermore, several aspects of the resulting dataset have been studied, such as the necessity of additional context by the annotators to make a decision, or the distribution of the vocabulary used in the examples labelled as hate speech. In addition, several baseline experiments have been conducted using automatic classifiers, with a focus on examples that are difficult for automatic classifiers, such as those that required additional context or world knowledge. The resulting dataset is publicly available.
This dataset provides a good starting point for discussion and further research. As future work, it would be interesting to study how to include world knowledge and/or the context of an online conversation (i.e. previous and following messages, forum thread title, and so on) in order to obtain more robust hate speech automatic classifiers. Future studies could also explore how sentences labelled as RELATION affect classification, as this sentences have not been included in the experiments presented. In addition, more studies should be performed to characterize the content of the dataset in depth, regarding timelines, user behaviour and hate speech targets, for instance. Finally, since the proportion of HATE/NOHATE examples tends to be unbalanced, a more sophisticated manually labelling system with active learning paradigms would greatly benefit future labelling efforts.

Acknowledgements
This work has been supported by the European Commission under the project ASGARD (700381, H2020-FCT-2015). We thank the Hatebase team, in particular Hatebase developer Timothy Quinn, for providing Hatebase's English vocabulary dump to conduct this study. Finally, we would like to thank the reviewers of the paper for their thorough work and valuable suggestions.