Introducing CAD: the Contextual Abuse Dataset

Online abuse can inflict harm on users and communities, making online spaces unsafe and toxic. Progress in automatically detecting and classifying abusive content is often held back by the lack of high quality and detailed datasets.We introduce a new dataset of primarily English Reddit entries which addresses several limitations of prior work. It (1) contains six conceptually distinct primary categories as well as secondary categories, (2) has labels annotated in the context of the conversation thread, (3) contains rationales and (4) uses an expert-driven group-adjudication process for high quality annotations. We report several baseline models to benchmark the work of future researchers. The annotated dataset, annotation guidelines, models and code are freely available.


Introduction
Social media platforms have enabled unprecedented connectivity, communication and interaction for their users. However, they often harbour harmful content such as abuse and hate, inflicting myriad harms on online users (Waseem and Hovy, 2016;Schmidt and Wiegand, 2017a;Fortuna and Nunes, 2018;Vidgen et al., 2019c). Automated techniques for detecting and classifying such content increasingly play an important role in moderating online spaces.
Detecting and classifying online abuse is a complex and nuanced task which, despite many advances in the power and availability of computational tools, has proven remarkably difficult (Vidgen et al., 2019a;Wiegand et al., 2019;Schmidt and Wiegand, 2017b;Waseem et al., 2017). As  argued in a recent review, research has 'struggled to move beyond the most obvious tasks in abuse detection.' One of the biggest barriers to creating higher performing, more robust, nuanced and generalisable classification systems is the lack of clearly annotated, large and detailed training datasets. However, creating such datasets is time-consuming, complicated and expensive, and requires a mix of both social and computational expertise.
We present a new annotated dataset of ∼25,000 Reddit entries. It contains four innovations that address limitations of previous labelled abuse datasets. First, we present a taxonomy with six conceptually distinct primary categories (Identity-directed, Person-directed, Affiliationdirected, Counter Speech, Non-hateful Slurs and Neutral). We also provide salient subcategories, such as whether personal abuse is directed at a person in the conversation thread or to someone outside it. This taxonomy offers greater coverage and granularity of abuse than previous work. Each entry can be assigned to multiple primary and/or secondary categories (Section 3).
Second, we annotate content in context, by which we mean that each entry is annotated in the context of the conversational thread it is part of. Every annotation has a label for whether contextual information was needed to make the annotation. To our knowledge, this is the first work on online abuse to incorporate a deep level of context. Third, annotators provided rationales. For each entry they highlighted the part of the text which contains the abuse (and the relevant parts for Counter Speech and Non-hateful Slurs). Fourth, we provide high quality annotations by using a team of trained annotators and a time-intensive discussion-based process, facilitated by experts, for adjudicating disagreements (Section 4).
This work addresses the need for granular and nuanced abusive content datasets, advancing efforts to create accurate, robust, and generalisable classification systems. We report several baseline models to benchmark the work of future researchers (Section 5). The annotated dataset, annotation codebook and code have been made available. 1 A full description of the dataset is given in our data statement in the Appendix (Bender and Friedman, 2018).

Background
Taxonomies of abuse Taxonomies vary in terms of the scope of abusive behaviours they cover. Some offer categories for abuse against both individuals and groups (Zampieri et al., 2020), others cover only abuse against identities Fortuna and Nunes, 2018;Kiela et al., 2020), against only a single identity, such as misogyny (Anzovino et al., 2018) or Islamophobia (Vidgen and Yasseri, 2019), or only abuse against individuals (Wulczyn et al., 2017). Some research distinguishes between content in different languages or taken from different platforms (Kumar et al., 2018). Waseem et al. (2017) outline two dimensions for characterising online abuse. First, whether it is directed against individuals or groups. Second, whether it is implicit or explicit (also referred to as 'covert' or 'overt' (Kumar et al., 2018) and'weak' or 'strong' (Vidgen andYasseri, 2019)). These two dimensions (strength and target) have been further developed in other studies. Zampieri et al. (2019) use a hierarchical three-level approach to annotation, separating (a) offensive from notoffensive tweets, (b) offensive into targeted and untargeted statements and (c) for targeted statements, identification of what is attacked (group, individual or other). Vidgen et al. (2019a) propose a tripartite distinction, also separating 'conceptdirected' abuse from group-directed and persondirected abuse. However, this is problematic as concept-directed content may be better understood as legitimate critique.
Some taxonomies explicitly separate abuse from closely-related but non-abusive forms of online expression. This reflects social scientific insights which emphasize the importance, but also difficulty, of making such distinctions (Rossini, 2019(Rossini, , 2020.  distinguish hostility against East Asia from criticism of East Asia, as well as counter speech and discussion of prejudice. Procter et al. (2019) distinguish cyber hate from counter speech, as do Qian et al. (2019) and Mathew et al. (2019), amongst others.

Annotation and Data
The quality of annotations for abusive datasets has been widely critiqued, and inter-rater agreement scores are often remarkably low. Wulczyn et al. (2017) report an Alpha of 0.45, Sanguinetti et al. (2018) Kappas from k=0.37 for offence to k=0.54 for hate, Gomez et al. (2020) report Kappa of 0.15 in the "MMH150" dataset of hateful memes, and Fortuna and Nunes (2018) report a Kappa of 0.17 for a text-only task. In a classification study of prejudice against East Asia,  find that 27% of classification errors are due to annotation mistakes. Low agreement is partly because abuse is inherently ambiguous and subjective, and individuals can perceive the same content very differently (Salminen et al., 2019(Salminen et al., , 2018. Many abusive content datasets use crowdsourced annotations (Zampieri et al., 2019;Fortuna and Nunes, 2018;. They are cheap and scalable but can be low quality and are often ill-suited to complicated tasks (Sabou et al., 2014). Trained experts with clear guidelines are often preferable for ensuring consistency (Vidgen and Derczynski, 2020). Whether experts-or crowdsourced annotators are used, a diverse pool is needed as annotators encode their biases, backgrounds and assumptions into their annotations (Sap et al., 2019;Waseem et al., 2017). Most datasets use a simple majority vote over annotations to determine the final labels. However, majority agreement does not guarantee that content is correctly labelled, especially for complex edge-cases. One option is to use a method that adjusts annotators' impact based on their quality, such as MACE (Hovy et al., 2013). However, this may not work well on the most ambiguous content. Groupdecision making processes present a promising way of improving annotation quality. Breitfeller et al. (2019) use a collaborative multi-stage process to label micro-aggression and Card et al. (2015) use a similar process for labelling news articles. This ensures more oversight from experts and reflection by annotators on the difficult content. It also provides a feedback loop for annotators to learn from mistakes and improve.
A well-established problem with abusive content datasets is that each bit of content is marked up individually, without taking into account any content that came before (Gao and Huang, 2017;Mubarak et al., 2017). This can lead to poor quality annotations when content is ambiguous or unclear without knowing the context. Detection systems which do not account for context are likely to be less applicable in the real-world, where nearly all content appears in a certain context (Seaver, 2015). Pavlopoulos et al. (2020) systematically investigate the role of context in a dataset of Wikipedia comments by providing annotators the 'parent' before showing them the 'child' entry. In one experiment at least 5% of the data was affected. In a study of Twitter conversations Procter et al. (2019) label replies to tweets based on whether they 'agree' or 'disagree' with the original message. Notwithstanding these studies, further work is needed to better understand the role of context and how abuse emerges within threads, as well as the challenges of detecting deeply contextual content.

Taxonomy
We present a hierarchical taxonomy of abusive content, which comprises six primary categories and additional secondary categories. It builds on critical social scientific research (Marwick and Miller, 2014;Citron and Norton, 2011;Lenhart et al., 2016), and addresses issues in previous taxonomies, including those provided by Zampieri et al. (2020), Waseem et al. (2017), Founta et al. (2018 and Vidgen et al. (2019a). It offers greater coverage by including three conceptually distinct types of abusive content (Identity-directed abuse, Affiliationdirected abuse and Person-directed abuse) as well as three types of non-abusive content (Neutral, Counter Speech and Non-hateful Slurs). The tax- onomic structure is shown in Figure 1. Indicative examples are given in Table 1.

Identity-directed abuse
Content which contains a negative statement made against an identity. An 'identity' is a social category that relates to a fundamental aspect of individuals' community, socio-demographics, position or self-representation (Jetten et al., 2004). It includes but is not limited to Religion, Race, Ethnicity, Gender, Sexuality, Nationality, Disability/Ableness and Class. The secondary category comprises five subtypes of identity-directed abuse: Derogation, Animosity, Threatening language, Glorification and Dehumanization.
Derogation Language which explicitly attacks, demonizes, demeans or insults a group. Derogation includes representing or describing a group in extremely negative terms and expressing negative emotions about them. Derogation is the basis of most 'explicit' forms of abuse in existing hateful content taxonomies, although it is often referred to  Animosity Language which expresses abuse against a group in an implicit or subtle manner. The lynchpin of this category is that negativity is directed at the group (i.e., there must be some aspect which is discernibly abusive or demeaning about the group in question) but this is not expressed explicitly. Animosity includes undermining the experiences and treatment of groups, ridiculing them, and accusing them of receiving 'special treatment'. Animosity is similar to the 'implicit' category used in other taxonomies (Waseem et al., 2017;Vidgen and Yasseri, 2019;Kumar et al., 2018).
Threatening language Language which either expresses an intent/desire to inflict harm on a group, or expresses support for, encourages or incites such harm. Harm includes physical violence, emotional abuse, social exclusion and harassment. This is one of the most harmful forms of hateful language (Marwick and Miller, 2014;Citron and Norton, 2011) yet usually it is part of an 'explicit' hate category (Zampieri et al., 2019;Wulczyn et al., 2017;Waseem and Hovy, 2016) and few datasets have treated it as a separate category, see Golbeck et al. (2017), Anzovino et al. (2018), and Hammer (2014) for exceptions.
Dehumanization Language which maliciously describes groups as insects, animals and nonhumans (e.g., leeches, cockroaches, insects, germs, rats) or makes explicit comparisons. Dehuman-ization has been linked with real-world violence and is a particularly important focus for computational work (Leader Maynard and Benesch, 2016;Matsuda et al., 1993), yet is often combined into a broader 'explicit' category (Palmer et al., 2020;Kiela et al., 2020) and has been insufficiently studied on its own, apart from Mendelsohn et al. (2020).
Glorification of hateful entities Language which explicitly glorifies, justifies or supports hateful actions, events, organizations, tropes and individuals (which, collectively, we call 'entities'). It includes denying that identity-based atrocities took place (e.g., Genocide). Glorification is one of the least studied forms of hate computationally, likely because it is more ambiguous, particularly when individuals only express interest in the entities (de Gibert et al., 2018).

Affiliation-directed abuse
Content which express negativity against an affiliation. We define 'affiliation' as a (more or less) voluntary association with a collective. Affiliations include but are not limited to: memberships (e.g. Trade unions), party memberships (e.g. Republicans), political affiliations (e.g. Right-wing people) and occupations (e.g. Doctors). The same secondary categories for Identity-directed abuse apply to Affiliation-directed. In some previous taxonomies, affiliations have been mixed in with identities (Founta et al., 2018;Zampieri et al., 2019), although in general they have been excluded as out of scope (e.g. Waseem and Hovy (2016)).

Person-directed abuse
Content which directs negativity against an identifiable person, who is either part of the conversation thread or is named. Person-directed abuse includes serious character based attacks, such as accusing the person of lying, as well as aggression, insults and menacing language. Person-and Identity-directed forms of abuse are often addressed in separate taxonomies, although in some studies they have been merged into a more general 'toxic' category (Wulczyn et al., 2017;Golbeck et al., 2017). Recent work have addressed both types of content, recognising that they are conceptually different but often co-occur in the real-world and share syntactical and lexical similarities (Zampieri et al., 2019;Mandl et al., 2019). We provide two secondary categories for person-directed abuse: Abuse at a person who is part of the conversation thread and Abuse about a person who is not part of the conversation thread. The person must be clearly identified, either by their actual name, username or status (e.g. 'the president of America'). To our knowledge, this distinction has not been used previously.

Counter Speech
Content which challenges, condemns or calls out the abusive language of others. Counter Speech can take several forms, including directly attacking/condemning abusive language in unambiguous terms, challenging the original content and 'calling out' the speaker for being abusive. We use a similar approach to Qian et al. (2019) and Mathew et al. (2019) who also treat counter speech as a relational act that responds to, and challenges, actual abuse.

Non-hateful Slurs
A slur is a collective noun, or term closely derived from a collective noun, which is pejorative. Slurs include terms which are explicitly insulting (e.g. 'n*gga' or 'kebabi') as well as terms which implicitly express animosity against a group (e.g. 'Rainy' or 'Chad'). A slur by itself does not indicate identity-directed abuse because in many cases slurs are not used in a derogatory way but, rather, to comment on, counter or undermine genuine prejudice (Jeshion, 2013) -or they have been reclaimed by the targeted group, such as use of 'n*gga' by black communities Davidson and Weber, 2019). In this category we mark up only the non-hateful use of slurs. Hateful uses of slurs would fall under Identity-directed abuse.

Neutral
Content which does not contain any abuse, Nonhateful Slurs or Counter Speech and as such would not fall into any of the other categories.

Data collection
The low prevalence of online abuse in 'the wild' (likely as little as 0.1% in English language social media (Vidgen et al., 2019b)) means that most training datasets have used some form of purposive (or 'directed') sampling to ensure enough entries are in the positive class (Fortuna et al., 2020). However, this can lead to biases in the dataset (Ousidhoum et al., 2020) which, in turn may impact the performance, robustness and fairness of detection systems trained on them (Sap et al., 2019). Notably, the widely-used practice of keyword sampling can introduce topic and author biases, particularly for datasets with a high proportion of implicit abuse (Wiegand et al., 2019). Accordingly, like Qian et al. (2019), we use community-based sampling, selecting subreddits which are likely to contain higher-than-average levels of abuse and a diverse range of abuse. This should lead to a more realistic dataset where the abusive and non-abusive content share similarities in terms of topic, grammar and style. We identified 117 subreddits likely to contain abusive content, which we we filtered to just 16, removing subreddits which (1) had a clear political ideology, (2) directed abuse against just one group and (3) did not have recent activity. 187,806 conversation threads were collected over 6 months from 1st February 2019 to 31st July 2019, using the PushShift API (Gaffney and Matias, 2018). We then used stratified sampling to reduce this to 1,394 posts and 23,762 comments (25,156 in total) for annotation. See Data Statement in the Appendix for more information on how the initial 117 subreddits were identified.

Annotation
All posts and comments were annotated. The titles main body of posts were treated separately, resulting in 1,394 post titles, 1,394 post bodies and 23,762 comments being annotated (26,550 entries in total). All entries were assigned to at least one of the six primary categories. Entries could be assigned to several primary categories and/or several secondary categories. The dataset contains 27,494 distinct labels.
All entries were first independently annotated by two annotators. Annotators underwent 4 weeks training and were either native English speakers or fluent. See Data Statement in the Appendix for more information. Annotators worked through entire Reddit conversations, making annotations for each entry with full knowledge of the previous content in the thread. All disagreements were surfaced for adjudication. We used a consensusbased approach in which every disagreement was discussed by the annotators, facilitated by an expert with reference to the annotation codebook. This is a time-consuming process which helps to improve annotators' understanding, and identify areas that guidelines need to be clarified and improved. Once all entries were annotated through group consensus they were then reviewed in one-go by the expert to ensure consistency in how labels were applied. This helped to address any issues that emerged as annotators' experience and the codebook evolved throughout the annotation process. In some cases the labels may appear counter-intuitive. For instance, one entry starts "ITT: Bernie Sanders is imperfect and therefore is a garbage human being." This might appear like an insult, however the remainder of the statement shows that it is intended ironically. Similarly, use of "orange man bad" may appear to be an attack against Donald Trump. However, in reality it is supporting Trump by mocking left-wing people who are opposed to him. Nuances such as these only become apparent after multiple reviews of the dataset and through group-based discussions.
Targets of abuse For Identity-directed, Affiliation-directed and Non-hateful Slurs, annotators inductively identified targets. Initially, 1,500 targets were identified (including spelling variations), which was reduced to 185 through review and cleaning. All important distinctions, including intersectional identities and specific subgroups and outlooks (e.g., 'non-gender dysphoric transgender people') were retained. The identities were then grouped into 8 top level categories. The top level categories for Identity-directed abuse include Gender, Ableness/disability and Race.
Context For every annotation a flag for 'context' was given to capture how the annotation was made. If the primary/secondary label was based on just the entry by itself then 'Current' was selected. If knowledge of the previous content in the conversation thread was required then 'Previous' was selected. Context was primarily relevant in two ways. First, for understanding who a generic pronoun referred to (e.g., 'they'). Second, to express support for another users' abuse (e.g., Person 1 writes 'I want to shoot some X' and person 2 responds 'Go do it!'). If this context is not taken into account then the abuse would be missed. In some cases, only the context of a single previous statement was needed to understand an entry (as with the example just given), whereas in other cases several previous statements were required. For Neutral, no label is given for context. For Non-hateful Slurs, only 'Current' could be selected. Our definition of Counter Speech is relational, and so all Counter Speech require 'Previous' context. For Affiliation-, Identity-, and Person-directed approximately 25-32% of content were labelled with 'Previous' context.
Rationales For all categories other than Neutral, annotators highlighted the part of the entry related to the category. This is important for Reddit data where some comments are very long; the longest entry in our dataset has over 10k characters. As part of the adjudication process, just one rationale was selected for each entry, giving a single 'gold standard'.
Inter annotator agreement Inter annotator agreement for the primary categories was measured using Fleiss' Kappa. It was 'moderate' overall (0.583) (Mchugh, 2012). This compares favourably with other abusive content datasets (Gomez et al., 2020;Fortuna and Nunes, 2018;Wulczyn et al., 2017), especially given that our taxonomy contains six primary categories. Agreement was highest for Non-hateful slurs (0.754). It was consistently 'moderate' for Neutral (0.579), Person (0.513), Affiliation (0.453) and Identity (0.419) but was lower for Counter Speech (0.267). This reflects Counter Speech's low prevalence (meaning annotators were less experienced at identifying it) and the subjective nature of judging whether content counters abuse or is implicitly supportive. One challenge is that if annotators missed a category early on in a thread then they would also miss all subsequent context-dependent entries.

Prevalence of categories
The prevalence of the primary and secondary categories in the dataset is shown in Table 3  of the data, followed by Identity-directed abuse which accounts for 9.9%, Affiliation-directed abuse (5.0%), Person-directed abuse (4.0%), Counter Speech (0.8%) and Non-hateful use of slurs (0.5%). Animosity and Derogation are the most frequent secondary categories in Identity-directed and Affiliation-directed abuse, with Threatening language, Dehumanization and Glorification accounting for less than 5% combined. This is unsurprising given the severity of such language. Other training datasets for online abuse generally report similar or slightly higher levels of non-neutral content, e.g., in Gomez et al. (2020) 82% is neutral, in Waseem and Hovy (2016)

Experimental setup
Data splits For our classification experiments, we exclude entries that are "[removed]", "[deleted]" or empty because they were either a blank entry associated with a post title or a entry that only contained an image. We also exclude entries written by two prolific bots (SnapshillBot and AutoModerator) and non-English entries, which were identified by langid.py (Lui and Baldwin, 2012) and then manually verified. Entries with an image were included but the image was not used for classification.  Hyperparameters are tuned on the development set.
Classification task We automatically classify the primary categories. Due to the low prevalence of Non-hateful Slurs, these are not used as a separate category in the classification experiments. Instead, for the experiments, we re-assign entries with only a Non-hateful Slur label to Neutral. For entries that have a Non-hateful Slur label and at least one other label, we simply ignore the Nonhateful Slur label 2 . 1.94% of entries in the training set have more than one primary category. When we exclude Neutral entries (because these entries cannot have another category), this increases to 10.5%. The training data has a label cardinality of 1.02 (Tsoumakas and Katakis, 2007). We thus formulate the task as a multilabel classification problem. It is challenging given the highly skewed label distributions, the influence of context, and the multilabel setup.

Methods
We compare several popular baseline models. We only use the texts of entries as input. The context of entries (e.g., previous entries in a thread) are not taken into account; integrating context could be explored in future work.
Logistic Regression (LR) We use Logistic Regression with L2 regularization, implemented using scikit-learn (Pedregosa et al., 2011). There are different approaches to multilabel classification (Boutell et al., 2004;Tsoumakas and Katakis, 2007). One common approach is the Label Powerset method, where a new label is created for each unique label combination. However, this approach is not suitable for our data; many label combinations only have a few instances. Furthermore, classifiers would not be able to recognise unseen label combinations. We therefore use a binary relevance setup, where binary classifiers are trained for each label separately. Because the class distribution is heavily skewed, classes are weighted inversely proportional to their frequencies in the training data.

BERT and DistilBERT
We finetune the BERT base uncased model (Devlin et al., 2019) with commonly used hyperparameters (see the Appendix). Given BERT's sensitivity to random seeds (Dodge et al., 2020), each setting was run with five different random seeds. Our implementation uses the Hugging Face's Transformers library . We use a binary cross entropy loss and encode the labels as multi-hot vectors. Classes are weighted by their ratio of negative over positive examples in the training data. We also finetune DistilBERT , a lighter version of BERT trained with knowledge distillation.

Results
Evaluation metrics The precision, recall and F1 score for each primary category are reported in Table 4. In Table 5, we report micro and macro average F1 scores. Because of the highly skewed class distribution, we favor macro F1 scores. We also report the exact match accuracy (the fraction of entries for which the full set of labels matches).
Classifier comparison BERT performs best and achieves a substantial performance improvement over Logistic Regression (Macro F1 of 0.455 vs. 0.343). The performance of DistilBERT is slightly lower, but very close to BERT's performance. With both BERT and DistilBERT there is still much room for improvement on most categories. Note that a majority class classifier which labels everything as Neutral would achieve a high accuracy (0.818) but a low F1 macro score (0.180). There were no clear performance differences between entries from subreddits that were or were not included in the training data.
Primary categories Performance differs substantially between the different categories (Table 4). All classifiers attain high F1 scores on Neutral entries (LR: 0.859, BERT: 0.902); this is expected as the class distribution is highly skewed towards Neutral. Performance is lowest on Counter Speech (LR: 0.042, BERT: 0.091), possibly due to a combination of factors. First, this category has the lowest number of training instances. Second, inter-annotator agreement was lowest on Counter Speech. And third, all Counter Speech annotations are based on previous content in the thread.
Error analysis Qualitative analysis shows that the BERT model often misclassifies neutral content which mention identities (e.g., non-misogynistic discussions of women) or contains profanities and aggressive language. It tends to classify Affiliationand Identity-directed abuse which uses less aggressive language and contains fewer abusive keywords as Neutral. Surprisingly, many of the Persondirected entries which are misclassified as Neutral contain clear signals of abuse, such as profanities and overt aggression. No discernible pattern was observed with Counter Speech which was misclassified as a different category. For this category, the low performance may be attributed mostly to its low frequency in the training data.
Context Our benchmark models do not explicitly take into account context for prediction. As expected, all our models are worse at predicting the primary categories of entries where context was required for the annotation.   between these two secondary categories is small (e.g., BERT: 48.6% vs. 49.0%). Furthermore, for Identity-directed abuse the recall for animosity (LR: 36.2%, BERT: 45.3%) tends to be lower than the recall for derogation (LR: 49.0%, BERT: 65.9%), which is expected as animosity expresses abuse in an implicit manner and is often more nuanced. The larger difference for BERT vs. logistic regression shows the promise of more advanced models in distinguishing subcategories. For Affiliation-directed abuse, the differences are smaller. Here, the recall for animosity is (unexpectedly) slightly higher (LR: 43.3%, BERT: 49.5%) than for derogation (LR: 36.1%, BERT: 48.0%).
Label dependence The multilabel setup of this classification task makes this a challenging problem. All models tend to assign too many labels. For example, DistilBERT predicts only too few labels in 1.17% of the cases, the remainder predicting the right number (91.88%) or too many (6.96%). For BERT, the difference is even higher (1.06% too few; 9.21% too many labels).
Dependencies between labels are sometimes violated. In our taxonomy, entries which are Neutral cannot have another label, but our models violate this constraint in many cases. With DistilBERT 3.8% of the entries are classified as Neutral and at least one other class, this is even more so for BERT (5.4%) and (LR: 10.7%). Future work could therefore explore modeling relationships between labels.

Discussion and Conclusion
We have presented a detailed dataset for training abusive content classification systems. It incorporates relevant social scientific concepts, providing a more nuanced and robust way of characterisingand therefore detecting -abuse. We have also presented benchmark experiments, which show much room for improvement.
Our analyses indicate numerous areas to explore further, including creating systems which explicitly model the conversation threads to account for context. Predictive methods could be applied to understand and forecast when a conversation is turning toxic, potentially enabling real-time moderation interventions. More powerful models could also be applied to better distinguish the primary categories and to begin classification of the secondary categories. This could be achieved by also using the images to classify the content, which we did not do. Finally, we would also expect the rationales to be of considerable use in future experiments, both for classification and to understand the annotation process.
The current work has several limitations. First, the class distribution is heavily skewed towards the Neutral class and some abusive categories have low frequencies. This better reflects real-world prevalence of abuse but can limit the signals available for classification. Second, inter-annotator agreement was in-line with other research in this domain but could still be improved further, especially with 'edge case' content.

Ethical considerations
We follow the ACM's Code of Ethics and Professional conduct 3 , as well as academic guidelines for ethically researching activity on social media (Townsend and Wallace, 2017;Williams, 2019). Online abuse poses substantial risk of harm to online users and their communities, and there is a strong social justification for conducting this work.
Dataset collection We used the Pushshift API to collect data from Reddit 4 , which we accessed through the data dumps on Google's BigQuery using R 5 . The Pushshift API is a wrapper which allows large quantities of Reddit data to be accessed reliably and easily (Baumgartner et al., 2020;Gaffney and Matias, 2018). Our collection is consistent with Reddit's Terms of Service.
Ethical approval This project was given ethical approval on 18th March 2019, before any research had started, by The Alan Turing Institute (submission C1903-053). Reddit can be considered a public space in that discussion are open and posts are aimed at a large audience. In this way, it differs from a one-to-one or 'private' messaging service. When users sign up to Reddit, they consent to have their data made available to third parties, such as academics. Many users are aware of this and choose to use non-identifiable pseudonyms. Existing ethical guidance indicates that in this situation explicit consent is not required from each user (which is often infeasible), provided that harm to users is minimized at all times (Williams, 2019) and no 'real' quotes are attributed to them in the paper. We follow this guidance and do not provide any direct quotes. The examples given in Table 1 are synthetic. We also minimized how many entries we collected from each user so that each one comprises only a small part of the total dataset. At no point did any of the research team contact any Reddit users, minimizing the risk that any harm could be caused to them. Further, we decided not to review any profile information about the users, substantially minimizing the risk that any personally identifiable information is included in the dataset.

Treatment of annotators
We used trained annotators that were carefully recruited through the host institution (in line with their HR procedures). Crowdsourced workers were not used. Annotators were carefully supervised with weekly meetings and regular one-to-one discussions. We followed the guidelines provided by Vidgen et al. (2019a) for ensuring annotator welfare during the work. We provided annotators with access to support services throughout the project, including counselling support, although they were not used. Annotators were 4 https://pushshift.io/api-parameters/ 5 https://pushshift.io/ using-bigquery-with-reddit-data/ paid substantially above the living wage. They were paid holiday and all meetings and training time was paid.
Research team wellbeing To protect the wellbeing of the research team, we had regular catchup discussions, and made sure that the lead researchers were not exposed excessively to harmful content. We did not post anything about the project whilst it was conducted (to minimize the risk of attracting the attention of malicious online actors) and did not engage with any of the Reddit users or communities being studied.
Dataset information and quality We provide a Data Statement in the Appendix, following Bender and Friedman (2018), with full information about the dataset.
Baseline models We present baseline classification models in the paper. We have carefully considered how these models could be deployed and believe that this is highly unlikely given their performance. There is a risk of bias in any dataset, and associated models, and we have sought to provide as much information as possible in our dataset, documentation and other artefacts to enable future researchers to investigate these issues. We do not use demographic or identity characteristics in the formation of the dataset. We also do not provide information about individual annotators, only giving the overall profile of the annotation team. The computational time/power involved in creating the baselines was minimal.
users appear in total.
D. ANNOTATOR DEMOGRAPHICS The dataset includes annotations from 12 trained analysts. They were recruited through a competitive process. They underwent 4 weeks of training, including numerous one-to-one sessions. Work was conducted over 12 weeks, with each annotator working between 10 and 20 hours each week. Of the 12 annotators who contributed to the final dataset, 11 consented to provide information about their demographics. Age: 7 annotators were 18-29, 3 were 30-39 and 1 was 40-49. Gender: 4 were female and 7 were male. Ethnicity: 8 were white, 1 Latino, 1 of Middle Eastern ethnic origin and 1 was mixed. National identity: 7 were British, 1 American, 1 Ecuadorean, 1 Jordanian and 1 Polish. Social media use: 9 used social media more than once per day, and 2 use it once per day. Exposure to online abuse: All annotators had witnessed online abuse in the previous year, with 10 stating they had witnessed it more than 3 times and 1 stating they had witness it 2-3 times. Disagreements were adjudicated through group discussion with an expert in abusive online content. They are a post-doctoral researcher with extensive experience.
E. SPEECH SITUATION All Reddit comments and posts were made between 1st February 2019 and 31st July 2019. The intended audience is unknown but was most likely the other members of the subreddit.
F. TEXT CHARACTERISTICS The composition of the dataset, including the distribution of the Primary and Secondary categories, is described in the paper.

B Data and Model fitting B.1 Data
The application of the Context flag for the primary categories is shown in Table 7.