Detect All Abuse! Toward Universal Abusive Language Detection Models

Online abusive language detection (ALD) has become a societal issue of increasing importance in recent years. Several previous works in online ALD focused on solving a single abusive language problem in a single domain, like Twitter, and have not been successfully transferable to the general ALD task or domain. In this paper, we introduce a new generic ALD framework, MACAS, which is capable of addressing several types of ALD tasks across different domains. Our generic framework covers multi-aspect abusive language embeddings that represent the target and content aspects of abusive language and applies a textual graph embedding that analyses the user’s linguistic behaviour. Then, we propose and use the cross-attention gate flow mechanism to embrace multiple aspects of abusive language. Quantitative and qualitative evaluation results show that our ALD algorithm rivals or exceeds the six state-of-the-art ALD algorithms across seven ALD datasets covering multiple aspects of abusive language and different online community domains.


Introduction
Abusive language in online communities has become a significant societal problem (Nobata et al., 2016) and online abusive language detection (ALD) aims to identify any type of insult, vulgarity, or profanity that debases a target or group online. It is not only limited to detecting offensive language (Razavi et al., 2010), cyberbullying (Xu et al., 2012), and hate speech (Djuric et al., 2015), but also to more nebulous or implicit forms of abuse. Many social media companies and researchers have utilised multiple resources, including machine learning, human reviewers and lexicon-based text analytics to detect abusive language (Waseem, 2016;Qian et al., 2018). However, none of them can perfectly resolve the ALD task because of the difficulties of moderating user content and in classifying ambiguous posts (Metz and Issac, 2019). On the technical side, previous ALD models were developed on only a few subtasks (e.g. hate speech, racism, sexism) in a single domain (like Twitter), and each specialised model is not successfully transferable to general ALD in different online communities.
Our research question is, "What would be the best generic ALD model that can be used for different types of abusive language detection sub-tasks and in different online communities?" To solve this, we found that Waseem et al. (2017) reviewed the existing online abusive language detection literature, and defined a generic abusive language typology that can encompass the targets of a wide range of abusive language subtasks in different types of domain. The typology is categorised in the following two aspects: 1) Target aspect: The abuse can be directed towards either a) a specific individual/entity or b) a generalised group. This is an essential sociological distinction as the latter refers to a whole category of people, like a race or gender, rather than a specific individual or organisation; 2) Content aspect: The abusive content can be explicit or implicit. Whether directed or generalised, explicit abuse is unambiguous in its potential to be damaging, while implicit abusive language does not immediately Dataset Source Size Composition Waseem (Waseem and Hovy, 2016) Twitter 16.2k Racism(11.97%), Sexism(19.43%), None(68.60%) HatEval (Basile et al., 2019) Twitter 13k Hateful(42.08%), Non-hateful(57.92%) OffEval (Zampieri et al., 2019) Twitter 13.2k Offensive(33.23%), Not-offensive(66.77%) Davids  Twitter 24.8k Hate(5.77%), Offensive(77.43%), Neither(16.80%) Founta (Founta et al., 2018) Twitter 99k Abusive(27.15%), Hateful(4.97%), Normal(53.85%), Spam(4.97%) FNUC (Gao and Huang, 2017) Fox News Discussion Threads 1.5k Hateful(28.50%), Non-hateful(71.50%) StormW (de Gibert et al., 2018) Stormfront(forum) 10.7k Hate(10.93%), NoHate(89.07%) Table 1: Comparison and Statistical analysis of seven benchmark datasets evaluated in this paper. The composition column represents different class aspects, and the class distribution in each dataset.
imply abuse (through the use of sarcasm, for example). For example, assume that we have a tweet "F***". "You are sooo sweet like other girls". It includes all those aspects; the directed target ("yourself"), the generalised target ("girls"), the explicit content ("F***"), and the implicit content ("You are sooo sweet"). Inspired by this abusive language typology, we propose a new generic ALD framework, MACAS (Multi-Aspect Cross Attention Super Joint for ALD), using aspect models and a cross-attention aspect gate flow. First, we build four different types of abusive language aspect embeddings, including directed target, generalised target, explicit content, and implicit content. We also propose to use a heterogeneous graph to analyse the linguistic behaviour of each author and learn word and document embeddings with graph convolutional networks (GCNs). Not every online community (e.g. news forums) allows user-to-user relationship (e.g. follower-following), so we avoid using user-community relationship information. Then, we propose a cross-attention aspect gate flow to obtain the mutual enhancement between the two aspects. The gate flow contains two gates, target gate and content gate, then fuses the outputs of those gates. The target gate draws on the content probability distribution, utilising the semantic information of the whole input sequence along with the target source, while the content gate takes in the target aspect probability distribution as supplementary information for content-based prediction. For evaluation, we test six stateof-the-art ALD models across seven datasets focused on different aspects and collected from different domains. Our proposed model rivals or exceeds those ALD methods on all of the evaluated datasets. The contributions of the paper can be summarised as follows: 1) We perform a rigorous comparison of six state-of-the-art ALD models across seven ALD benchmark datasets, and find those models do not embrace different types of abusive language aspects in different online communities. 2) We propose a generic new ALD algorithm that enables explicit integration of multiple aspects of abusive language, and detection of generic abusive language behaviour in different domains. The proposed model rivals state-of-the-art algorithms on ALD benchmark datasets and performs best overall.

ALD Datasets
We briefly review the seven ALD benchmark datasets (Table 1), which were collected from different online community sources and focused on multiple compositions. Waseem (Waseem and Hovy, 2016) is a Twitter ALD dataset regarding the specific aspects of racist and sexist. The collected tweets were labeled into Racism, Sexism or None. HatEval (Basile et al., 2019) is a Twitter-based hate speech detection dataset released in SemEval-2019. It provides a general-level hate speech annotation, Hateful or Non-hateful, especially against immigrants and women. OffEval (Zampieri et al., 2019) covers the Twitter-based offensive language detection task in SemEval-2019. It annotates as Offensive or Not-offensive, and includes insults, threats, and any form of untargeted profanity. Davids ) is a Twitter-based ALD dataset, which includes three classes, Hate, Offensive or Neither based on the hate speech lexicon from Hatebase.org. Founta (Founta et al., 2018) is a large Twitter-based ALD dataset claimed to be annotated with high accuracy based on their proposed incremental and iterative annotation method. It is annotated with four classes, Hateful, Abusive, Normal or Spam. FNUC (Gao and Huang, 2017) is a hate speech detection dataset, which was collected from complete Fox News discussion threads, and annotated with the general level categories Hateful or Non-hateful. StormW (de Gibert et al., 2018) is a Stormfront-based hate speech detection dataset with general-level labels Hate and NoHate. Stormfront is a supremacist forum where people promote white nationalism and antisemitism.

ALD Approaches
In the early stages, ALD was commonly addressed via hand-crafted rules and manual feature engineering. The first reported ALD work (Spertus, 1997) utilised a decision tree to detect hostile messages based on heuristic rules. Yin et al. (2009) andRazavi et al. (2010) added lexicon-based features together with semantic rules and designed a linear SVM and Naïve Bayes classifier for detecting hostile language. Djuric et al. (2015) first applied in ALD neural networks with the paragraph2vec (Le and Mikolov, 2014) representation. Nobata et al. (2016) introduced a Yahoo! dataset and tested it with neural networks by applying a combination of word, character-based and syntactic features. Recently, deep learning techniques have become popular in ALD. Badjatiya et al. (2017) tested FaxtText/Glove, Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTMs) in detecting hate speech. Park and Fung (2017) designed a HybridCNN (word-level and character-level) model on abusive tweet detection in both one-step and two-step style. Several works have applied bidirectional Gated Recurrent Unit (Bi-GRU) networks with Latent Topic Clustering (LTC) Lee et al. (2018) and a a transformer-based framework Bugueño and Mendoza (2019). Some works integrated user profiling into their ALD models. Qian et al. (2018) utilised the bi-LSTM to model the historical behaviour of users to generate inter-user and intra-user representation. Mishra et al. (2018) applied node2vec (Grover and Leskovec, 2016) to the constructed community graph of users to derive the user embedding. However, a user profiling-based approach is only possible when the user profiles are public and when the domain provides the user-community relation information.

The MACAS ALD Model
We propose the Multi-Aspect Cross Attention Super Joint model for ALD. It is designed as an generic ALD that can embrace different types of abusive language aspects in different online communities. As shown in Figure 1, MACAS can be divided into three main phases: 1) Multi-Aspect features embedding[Sec.3.1]. The Multi-Aspect Embedding Layer represents understanding of multi-aspects of abusive language for detecting generic abusive language behaviours. We focus on two main aspects, target and content, and each aspect has two sub-aspects. 1) Target aspect represents abuse directed towards either a) a specific individual/entity or b) a generalised group (e.g. gender or race). 2) Content aspect covers a) explicit or b)implicit. Explicit abuse is unambiguous in its potential to be damaging, while implicit abusive language does not immediately impact (e.g. sarcasm). In addition to this, if the platform provides users' historical posts, we apply Graph Convolutional Network(GCN)s to build a word-document graph embedding that represents linguistic behaviours of users. Not every online community (e.g. news forums) has user-to-user relationships (e.g. follower-following), so we avoid using user-community relationship and community network information. 2) Cross-Attention Gate Flow for integrating multi-aspects [Sec.3.2] The Cross-Attention gate produces the joint integration of the target aspect and content aspect model and obtains the mutual enhancement between the two aspects. This is for producing well-integrated multi-aspects and improving the performance of generic ALD. 3) Final Aggregation of learned ALD embeddings [Sec.3.3] We aggregate multi-aspect embeddings and the user's linguistic behaviour embedding across the online post using convolutional neural networks, and produce the ALD using multi-layer-perceptron.

Target: Directed Abuse Embedding
Directed abuse is abuse towards a specific individual or entity (Waseem et al., 2017). To model this aspect, a named entity recognition (NER) approach is used. To train the NER model, we apply stacked bi-directional LSTMs, which are one of the state-of-the-art models (Chiu and Nichols, 2016). We extract the vector before the final Sof tmax layer of the NER model and use it as the Directed Abuse Embedding.

Target: Generalised Abuse Embedding
Generalised abuse tends to target people belonging to a small set of categories, primarily gender. The gender debiasing embedding (Kaneko and Bollegala, 2019) is applied. The vocabulary set (V ) is split into 4 mutually exclusive sets of words, namely, masculine (V m ), feminine (V f ), neutral (V n ) and stereotypical (V s ). Each word is represented by a vector which is calculated by minimising a loss function to satisfy the criteria: 1) protect the feminine information for words in V f ; 2) protect the masculine information for words in V m ; 3) protect the neutrality for words in V n (iv) remove gender biases for words in V s .

Content: Explicit Abuse Embedding
For the explicit abuse, whether the target is directed or generalised, explicit abuse is usually indicated by specific keywords from the homophobic slurs lexicon. We used dict2vec (Tissier et al., 2017), which aims to learn word embeddings based on natural language dictionaries. In this paper, the model is trained by Cambridge, Collins, Oxford, dictionary.com, and we add an abusive language lexicon 2 . This approach first defines strong pairs and weak pairs of words. If both words appear in each other's definition, the word pair is defined as a strong pair. If only one word appears in the other's definition, the word pair is defined as a weak pair. If the words do not appear in each other's definition they are not related. Each word is represented by a vector. Strongly paired words have more similar vectors then weakly paired words which in turn have more similar vectors than unrelated words.

Content: Implicit Abuse Embedding
Implicit abusive language does not immediately imply or denote abuse, similar to sarcasm. Here we use a hybrid of CNN and LSTM-based sarcasm detection models (Ghosh and Veale, 2016). The vector before the final Sof tmax layer of the sarcasm detection model is the Implicit Abuse Embedding.

Additional: User Linguistic Behaviour Embedding
We model the graph by setting each comment in the training set as a document. The vocabulary is the set of all words in the documents. The corpus is the collection of all documents. The nodes of our graph are the union of the documents and the vocabulary. An edge weighted 1 exists between each node and itself. An edge exists between a document and a word if the word is in that document. The edge is weighted with the TF-IDF for the (document, word) pair, within the corpus. An edge exists between two words if they have a non-negative point-wise mutual information (PMI) with a sliding window size of 20, within the corpus. The weight for the edge is the PMI for the word pair. The edge weightings are compiled into an adjacency matrix combined with the graph's degree matrix and passed into a 2 layer GCN trained to map each document to each user as a label. For datasets without user id provided, we use the actual classification target as the document node label. From this network, we obtain embeddings for each node, that is an embedding of each document or each word. The trained word embeddings G e are fed into transformer encoders to get linguistic behaviour outputs.

Cross-Attention Gate Flow
In the Cross-Attention Gate Flow, first, we use a cross transformer encoder for refining our four types of embedding: Directed abuse embedding D, Generalized abuse embedding G, Explicit abuse embedding E and Implicit abuse embedding I. Before putting them into the cross transformer encoders, we combine D with G as Target embedding T e and broadcast I to sequence length N , them combine it with E as Content embedding C e . Normally, for the transformer encoder (Vaswani et al., 2017), the attention is calculated using key (K of dimension d k ), query (Q), value (V ): However, to produce the joint integration of target aspect model and content aspect model, we apply the cross-transformer to T e and C e . As shown in the Figure 2 for each transformer encoder, we have K,Q,V for T e and C e . The K,V of T e and C e are switched, which means K,V of T e goes to the transformer encoder of C e and K,V of C e goes to T e 's encoder. Then attention is calculated by We call the cross transformer here Cross at Beginning(CB). Similar to the original transformer encoder, each encoder contains one or more encoder stack(s), which mainly consists of two sub-layers: a multi-head attention layer and a fully connected feed-forward neural network (FNN). A residual connection followed by layer normalization is employed around each of the two sub-layers before feeding to the next sub-layer. Another way to produce the joint integration occurs before the FNN layer. The output of Multi-Head Attention will be the input for the FNN layer, and then an Add & Norm layer is applied. Normally, the output of transformer encoder is calculated by The input for FNN can also be switched for Content and Target, which is called Cross in the Middle (CM), the output of transformer encoder will be calculated by If the cross happens both at the beginning and in the middle, the structure will be called Cross at the Beginning and in the Middle (CBM). The comparison of different cross transformer structures will be discussed in 5.2. Both of the input embeddings T e and C e are of shape [N , D e ], where D e is the sum of the dimension of the concatenated embedding. The transformer encoder will output T h and C h in the same shape [N , D e ]. The hidden state of encoders T h from T e and C h from C e will be used to compute the initial abusive language probability, which is the major input of our bi-directional aspect gate flow.
On top of the Cross-Attention, we introduce the Bi-directional Aspect Gate Flow that contains two gates: content gate and target gate. Denote the input sequences to our gates from the previous layer encoder as T h ∈ R N ×D T and C h ∈ R N ×D C where N is the sequence length while D T and D C equal to dimension of target embedding and content embedding respectively. In the content gate, we first flatten T h to be T hf ∈ R 1×(N * D T ) . We then pass T hf through a dense layer and apply the Sof tmax function. The resultant P T h is a D-dimensional probability vector, where D = N cls is the number of distinct labels to classify, W C ∈ R D×D C is the weight matrix and b C ∈ R 1×D is the bias vector. Then we broadcast P T h over N tokens. This yieldsP T h ∈ R N ×D . Then we concatenateP T h with transformer encoder output state C h from content source, generating the augmented content state O C ∈ R N ×(D+D C ) . We then again flatten O C and pass the output to the dense layer, producing an output matrix P C ∈ R 1×D .
The procedure in the target gate is almost the same as the content gate. Here we flattened the input sequence C h , generating the flattened output C hf ∈ R 1×(N * D C ) . We then pass the result through a dense layer and apply the Sof tmax function. The resultant P Ch is also broadcast to beP Ch and then concatenated with the target encoder output state T h , where O T ∈ R N ×(D+D T ) is the augmented target state as output matrix. Finally, O T is also flattened and then passed to the dense layer, which produces the output matrix P T ∈ R 1×D .

Final Fusion
We propose a hierarchical fusion, which fuses linguistic behaviour outputs (P G ) with content gate output (P C ) and target gate output (P T ) respectively and uses two CNNs to integrate that fusion to get C C and C T , then we concatenate C C and C T then flatten it to F F . Finally, a multi-layer perceptron (MLP) is used for final prediction: Three layers are stacked. For the each layer, W i and b i represent the weight matrix and bias vector, and the ReLU activation function is used for the first two layers. For the last layer, to get the probability of each class Z, softmax layer is used.

Performance Comparison
In this part, we compare our model with six baseline models over all seven datasets, discussed in Sec 2.1. These baseline models are constructed with various word representations as well as different neural   networks or classifiers. Table 2 presents the weighted average f1 performance of each baseline model and our model over each dataset. Our model outperforms the baseline models for all these seven datasets.
Applying multiple aspect embeddings enables our model to process the texts from multi-perspective views. The Cross-Attention gate flow makes it possible to obtain the mutual enhancement between the two different aspects. Although some of the baseline models such as OTH, MFR also combine two embedding approaches (Chars2vec and Glove) to get more information, they still just consider the general information of the texts rather than extract information in a targeted fashion from various aspects. For these reasons our model can achieve performance above the baseline models.
As well as comparing our model with the baseline models, we also make some observations from comparing the six baseline models amongst themselves. Firstly, OTH and MFR use the combined embeddings of Chars2vec and Glove which gives more information. So, they can achieve relatively better weighted average f1 scores compared to most other baseline models which just use a single embedding method. Secondly, the results of TWL and LTC indicate that the bi-directional recurrent neural network leads to better performance than the simple forward recurrent neural network. This means that not only the future states but also the past ones will affect the prediction results. Thirdly, although we may not consider TF-IDF with SVM to be as good as Chars2vec or Glove with deep neural networks, TIS baseline model never gets the worst weighted f1 score for the seven datasets when compared with other models. In fact it even outperforms other baseline models on Waseem and Founta. For both datasets, there might be some particular words which are really significant for identifying the class. So TF-IDF can achieve good results for these two datasets.
Next, an analysis of examples to show how different models perform on different datasets is conducted, and the results are shown in Table 3. The first example is from Waseem, "Really bitch really? Get on the runway and back in the kitchen.", which should be predicted as Sexism. It is quite explicit in that the word "bitch" is in this sentence, and this makes TIS predict it as Sexism easily since TF-IDF is focusing on the word occurrence. Besides, "back in the kitchen" is implicit Sexism, implying women should be in the kitchen. The similar patterns can be found from the second instance "thegeek chick please answer" by explicitly mentioning the word 'chick'. The third and fourth samples represent abusive language or hate speech about the topic Feminism. The third explicitly stated the words 'Feminism' and 'Awful' and TIS and our model successfully detected the abuse with an explicit hate speech aspect identification. Our model, which considers the explicit and implicit aspects, can predict the sentence as Sexism easily. Another example is from FNUC, "BLM is a Hate group Given Birth by Obama's comments That put a  Table 4: Abusive language detection results across seven benchmark datasets for MACAS with two cross attention aspect gate flow mechanisms and graph embedding. We highlight the top 2 settings for each dataset. The darker the colour, the better the performance. The comparison provides different parameters (N) of final fusion layers, including N=1 or 3. (CB: cross-attention at the beginning, CBM: cross-attention at the beginning and the middle, G: the user linguistic behaviour graph embedding) target on the backs of police officers" which should be Hateful. This comment insults the "Black Life Matters" by calling it a Hate Group. Normally, describing something as a hate group is not hate speech, but in this case, calling BLM a hate group is racism. This is not easy for the baseline models to spot, and only our model predicts it correctly. For the last example from StormW, "i am a white nationalist of a christian faith but still am a white nationalist for racial survival the anti racist christians are the true chirstians enemy", the user described himself as "white nationalist" which is one kind of hate speech, and OTH can predict this sentence as Hate. The reason is that the CNN used in OTH can capture the information for phrases, which is the "white nationalist" here. Besides, our model can predict this sentence correctly since the sentence is a general explicit hate speech.

Ablation Testing -Cross-attention gate flow
In this part, three different structures of cross transformer encoders are tested: 1) Cross-transformer at the beginning of the the transformer encoder (CB): exchanging content's and target's K and V at the beginning of the transformer encoders as in Figure 2; 2) Cross-transformer in the middle of the transformer encoder (CM): exchanging content's and target's input for Feed Forward layer in the transformer encoder, which is in the middle of the transformer encoders; 3) Cross-transformer at both places (CBM): the combination of CB and CM. Due to the poor performance of CM, only results for 7 datasets with CB and CBM structure are shown in Table 4. Besides, to find whether and how GCN is improving the performance of our model, different structures are also compared: 1) Model without GCN; 2) Model with GCN using hierarchical fusion, repeating one or three times. We show one and three times here because on all the datasets our model achieves the best performance with one or three repeated fusions when GCN is also used. Two conclusions are drawn based on the results of CB and CBM: Firstly, the best model is always the CB model, and the second best is always the CBM model with the same GCN structure. So comparing between CB and CBM structure, CB has a better performance and we use this structure as our final model. Besides, in most cases, CB outperforms CBM if they share the same GCN structure, which also shows that, overall, CBM is worse than CB. Considering the fact that CM is the worst, we can say that cross in the middle transformer encoder will lower the model performance. Exchanging content's and target's K,V is important since it allows target aspects to query on the content aspects and vice versa. However, exchanging values before Feed Forward Layer only gives a different add and norm which doesn't increase the interaction between content aspects and target aspects usefully. Secondly, our model can have a better performance with GCN when there is user id in the dataset. Not all the datasets provide user id, and as mentioned in Sec 3.1, User Linguistic Behavior embedding is trained by using the user id as the target. For those datasets without userid, the real abusive labels are used as the training target. By comparison, we can find that Waseem, StormW, and FNUC which provide user id in the datasets have a better performance using a model with GCN, and the other four datasets, which don't provide user id, have a better performance using a model without GCN. Therefore, for the dataset with user id, User Linguistic Behavior which is from GCN, can improve the performance of our model. And for those datasets without user id, the model structure without GCN is recommended.  Table 5: Ablation studies comparing different types integration of multi-aspects for the generic ALD model. In the proposed model, MACAS, we introduced four aspect embeddings, including directed abuse (D), generalised abuse (G), explicit abuse (E), and implicit abuse (I). Directed and generalised abuses are in the group of a target aspect, while explicit and implicit abuses are in a content aspect group. The ablation testing is conducted in a different combination of aspect embedding from each higher-level of aspect groups. The highest performance is highlighted in green, the lowest is marked in red.

Ablation Testing -Multi-aspect embedding
To check how aspect embeddings contribute to the model, an ablation test on different combinations of the embeddings is conducted on all these seven datasets. We use the CB model without GCN for the prediction. Table 5 presents the weighted average f1 scores for 9 different combinations of four aspect embedding models, including Directed abuse D, Generalised abuse G, Explicit abuse E, and Implicit abuse I. Each target and content aspect should include at least one embedding. For Waseem, the D + G + E + I combination achieves the best performance with the weighted average f1 score 82.35 and most other combinations have a slightly lower performance. In contrast, D + I gets the worst weighted f1 score of 61.93. The reason why D + I is much worse than other combinations may lie in two facts: 1) In this dataset, abusive language is generally more explicit rather than directly aiming at a specific target in an implicit way. 2) Even humans can not distinguish Direct Abuse in an Implicit way easily, and it can be very difficult for the annotators to annotate the label correctly. Besides, the D + G + E + I combination outperforms other cases because it takes all the aspects into consideration. Similar results occur on other Twitter datasets Davids, HatEval, OffEval and Founta, D + G + E + I achieves the best while D + I is much worse. For FNUC, due to the small volume of dataset and imbalanced labels, not all the combinations have a good prediction result. D + G + E having the best performance implies that the dataset doesn't have a large number of implicit abuse samples. For StormW, D + G + E + I gets the best performance. Besides, G + E also has a good performance. The reason is that this dataset is collected from a racism forum and most hate speech on that website is generally abusive in an explicit way. Based on the analysis of the different embedding combinations on these datasets, we can conclude that the embeddings used may vary based on different kinds of datasets, but combining them all is always a good idea. Although four specific different embeddings are selected in our model to represent four different aspects, other kinds of embeddings could also be used as long as they can represent the corresponding aspects.

Conclusion
Abusive language detection is an essential but challenging task, and it is almost impossible to successfully encompass all different abusive language tasks in different domains. The evaluation also shows that most of the state-of-the-art ALD algorithms do not generalise their model to different types of abusive language problems or datasets. In this paper, we proposed a new generic abusive language model, called MACAS, which applied multi-aspect embeddings to represent generalised characteristics of the domain and introduced a cross-attention gate flow model to achieve better performance by mutual enhancement between the target aspect and the content aspect. The results indicate that our framework was successful and effective in capturing abusive language aspects in different domains. Compared to other ALD models, our model successfully works in general abusive language detection, and it is hoped that MACAS provides some insight into the future direction of generic abusive language detection.