HABERTOR: An Efficient and Effective Deep Hatespeech Detector

We present our HABERTOR model for detecting hatespeech in large scale user-generated content. Inspired by the recent success of the BERT model, we propose several modifications to BERT to enhance the performance on the downstream hatespeech classification task. HABERTOR inherits BERT's architecture, but is different in four aspects: (i) it generates its own vocabularies and is pre-trained from the scratch using the largest scale hatespeech dataset; (ii) it consists of Quaternion-based factorized components, resulting in a much smaller number of parameters, faster training and inferencing, as well as less memory usage; (iii) it uses our proposed multi-source ensemble heads with a pooling layer for separate input sources, to further enhance its effectiveness; and (iv) it uses a regularized adversarial training with our proposed fine-grained and adaptive noise magnitude to enhance its robustness. Through experiments on the large-scale real-world hatespeech dataset with 1.4M annotated comments, we show that HABERTOR works better than 15 state-of-the-art hatespeech detection methods, including fine-tuning Language Models. In particular, comparing with BERT, our HABERTOR is 4~5 times faster in the training/inferencing phase, uses less than 1/3 of the memory, and has better performance, even though we pre-train it by using less than 1% of the number of words. Our generalizability analysis shows that HABERTOR transfers well to other unseen hatespeech datasets and is a more efficient and effective alternative to BERT for the hatespeech classification.


Introduction
The occurrence of hatespeech has been increasing (Barna, 2019). It has become easier than before to reach a large audience quickly via social media, causing an increase of the temptation for inappropriate behaviors such as hatespeech, and potential damage to social systems. In particular, hatespeech interferes with civil discourse and turns good people away. Furthermore, hatespeech in the virtual world can lead to physical violence against certain groups in the real world 12 , so it should not be ignored on the ground of freedom of speech.
Recently, the BERT (Bidirectional Encoder Representations from Transformers) model (Devlin et al., 2019) has achieved tremendous success in Natural Language Processing . The key innovation of BERT is in applying the transformer (Vaswani et al., 2017) to language modeling tasks. A BERT model pre-trained on these language modeling tasks forms a good basis for further fine-tuning on supervised tasks such as machine translation and question answering, etc.
Recent work on hatespeech detection (Nikolov and Radivchev, 2019) has applied the BERT model and has shown its prominent results over previous hatespeech classifiers. However, we point out its two limitations in hatespeech detection domain. First, the previous studies (ElSherief et al., 2018b,a) have shown that a hateful corpus owns distinguished linguistic/semantic characteristics compared to a non-hateful corpus. For instance, hatespeech sequences are often informal or even in-tentionally mis-spelled (ElSherief et al., 2018a;Arango et al., 2019), so words in hateful sequences can sit in a long tail when ranking their uniqueness, and a comment can be hateful or non-hateful using the same words (Zhang and Luo, 2019). For example, "dick" in the sentence "Nobody knew dick about what that meant" is non-hateful, but "d1ck" in "You are a weak small-d1cked keyboard warrior" is hateful 3 . Thus, to better understand hateful vocabularies and contexts, it is better to pre-train on a mixture of both hateful and non-hateful corpora. Doing so helps to overcome the limitation of using BERT models pre-trained on non-hateful corpora like English Wikipedia and BookCorpus. Second, even the smallest pre-trained BERT "base" model contains 110M parameters. It takes a lot of computational resources to pre-train, fine-tune, and serve. Some recent efforts aim to reduce the complexity of BERT model with the knowledge distillation technique such as DistillBert (Sanh et al., 2019) and TinyBert (Jiao et al., 2019). In these methods, a pre-trained BERT-alike model is used as a teacher model, and a student (smaller) model (i.e. TinyBERT, DistilBERT, .etc) is trained to produce similar output to that of the teacher model. Unfortunately, while their complexity is reduced, the performance is also degraded in NLP tasks compared to BERT. Another direction is to use cross-layer parameter sharing, such as ALBERT (Lan et al., 2020). However, ALBERT's computational time is similar to BERT, since the number of layers remains the same as BERT; likewise, its inference is equally expensive.
Based on the above observation and analysis, we aim to investigate whether it is possible to achieve a better hatespeech prediction performance than state-of-the-art machine learning classifiers, including classifiers based on publicly available BERT model, while significantly reducing the number of parameters compared with the BERT model. By doing so, we believe that performing pre-training tasks from the ground up and on a hatespeechrelated corpus would allow the model to understand hatespeech patterns better and enhance the predictive results. However, while language model pretraining tasks require a large scale corpus size, available hatespeech datasets are normally small: only 16K∼115K annotated comments (Waseem and Hovy, 2016;Wulczyn et al., 2017). Thus, we introduce a large annotated hatespeech dataset with 1.4M comments extracted from Yahoo News and Yahoo Finance. To reduce the complexity, we reduce the number of layers and hidden size, and propose Quaternion-based Factorization mechanisms in BERT architecture. To further improve the model effectiveness and robustness, we introduce a multi-source ensemble-head fine-tuning architecture, as well as a target-based adversarial training.
The major contributions of our work are: • We reduce the number of parameters in BERT considerably, and consequently the training/inferencing time and memory, while achieving better performance compared to the much larger BERT models, and other state-of-the-art hatespeech detection methods. • We pre-train from the ground up a hateful language model with our proposed Quaternion Factorization methods on a large-scale hatespeech dataset, which gives better performance than fine tuning a pretrained BERT model. • We propose a flexible classification net with multi-sources and multi-heads, building on top of the learned sequence representations to further enhance our model's predictive capability. • We utilize adversarial training with a proposed fine-grained and adaptive noise magnitude to improve our model's performance.

Related Work
Some of the earlier works in hatespeech detection have applied a variety of classical machine learning algorithms (Chatzakou et al., 2017;Davidson et al., 2017;Waseem and Hovy, 2016;MacAvaney et al., 2019). Their intuition is to do feature engineering (i.e. manually generate features), then apply classification methods such as SVM, Random Forest, and Logistic Regression. The features are mostly Term-Frequency Inverse-Document-Frequency scores or Bag-of-Words vectors, and can be combined with additional features extracted from the user account's meta information and network structure (i.e., followers, followees, etc). Those methods are suboptimal as they mainly rely on the quality and quantity of the human-crafted features. Recent works have used deep neural network architectures for hatespeech detection (Zampieri et al., 2019;Mou et al., 2020) such as CNN (Gambäck and Sikdar, 2017;Park and Fung, 2017), RNN (i.e. LSTM and GRU) (Badjatiya et al., 2017;Agrawal and Awekar, 2018), combining CNN with RNN (Zhang et al., 2018), or fine tuning a pretrained language models (Indurthi et al., 2019).
Unlike previous works, we pre-train a hateful language model, then build a multi-source multihead hatespeech classifier with regularized adversarial training to enhance the model's performance.

Problem Definition
Given an input text sequence s = [w 1 , w 2 , ..., w n ] where {w 1 , w 2 , .., w n } are words and n = |s| is the maximum length of the input sequence s. The hatespeech classification task aims to build a mapping function f : s = [w 1 , w 2 , ..., w n ] −→ R ∈ [0, 1], where f inputs s and returns a probability score P (y = 1|s) ∈ [0, 1], indicating how likely s is classified as hatespeech. In this paper, we approximate f by a deep neural classifier, where we first pretrain f with unsupervised language modeling tasks to enhance its language understanding. Then, we train f with the hatespeech classification task to produce P (y = 1|s).

Tokenization
BERT model relies on WordPiece (WP) (Wu et al., 2016), a Google's internal code that breaks down each word into common sub-word units ("wordpieces"). These sub-words are like character ngrams, except that they are automatically chosen to ensure that each of these sub-words is frequently observed in the input corpus. WP improves handling of rare words, such as intentionally misspelled abusive words, without the need for a huge vocabulary. A comparable implementation that is open sourced is SentencePiece (SP) (Kudo and Richardson, 2018). Like WP, the vocab size is predetermined. Both WP and SP are unsupervised learning models. Since WP is not released in pub-lic, we train a SP model using our training data, then use it to tokenize input texts. is n e w s ? is fi n a n c e ? pooling pooling P( y | s, "news" ) P( y | s, "finance" ) (b) HABERTOR with two sources and ensemble of 2 heads. Quaternion transformations from 4H to I, then from I to H. This results in (HI + IH/4) parameters, compared to 4H 2 in BERT. When we apply all the above compression techniques together, the total parameters are reduced to VE + EH/4 + L(3CH/4 + H 2 + 5HI/2). Particularly, with BERT-base settings of V=32k, H=768, L=12, if we set E=128, C=192, and I=128, the total of parameters is reduced from 110M to only 8.4M.

Pretraining tasks
Similar to BERT, we pre-train our HABERTOR with two unsupervised learning/language modeling tasks: (i) masked token prediction, and (ii) next sentence prediction. We describe some modifications that we made to the original BERT's implementation as follows:

Masked token prediction task
BERT generates only one masked training instance for each input sequence. Instead, inspired by Liu et al. (2019), we generate τ training instances by randomly sampling with replacement masked positions τ times. We refer to τ as a masking factor. Intuitively, this helps the model to learn differently combined patterns of tokens in the same input sequence, and boosts the model's language understanding. This small modification works especially well when we have a smaller pre-training data size, which is often true for a domain-specific task (e.g., hatespeech detection).

Next sentence prediction task
In BERT, the two input sentences are already paired and prepared in advanced. In our case, we have to preprocess input text sequences to prepare paired sentences for the next sentence prediction task. We conduct the following preprocessing steps: Step 1: We train an unsupervised sentence tokenizer from nltk library. Then we use the trained sentence tokenizer to tokenize each input text sequence into (splitted) sentences.
Step 2: In BERT, 50% of the chance two consecutive sentences are paired as next, and 50% of the chance two non-consecutive sentences are paired as not next. In our case, our input text sequences can be broken into one, two, three, or more sentences. For input text sequences that consist of only one tokenized sentence, the only choice is to pair with another random sentence to generate a not next example. By following our 50-50 rule described in the Appendix, we ensure generating an equal number of next and not next examples.

Training the hatespeech prediction task
For hatespeech prediction task, we propose a multisource multi-head HABERTOR classifier. The architecture comparison of the traditional fine-tuning BERT and our proposal is shown in Figure 1. We note two main differences in our design as follows.
First, as shown in Figure 1b, our HABERTOR has separated classification heads/nets for different input sequences of different sources but with a shared language understanding knowledge. Intuitively, instead of measuring the same probabilities P (y|s) for all input sequences, it injects additional prior source knowledge of the input sequences to measure P (y|s, "news") or P (y|s, "finance").
Second, in addition to multi-source, HABER-TOR with an ensemble of h heads provides even more capabilities to model data variance. For each input source, we employ ensemble of several classification heads (i.e. two classification heads for each source in the Figure 1b) and use a pooling layer on top to aggregate results from those classification heads. We use three pooling functions: min, max, mean. min pooling indicates that HABERTOR classifies an input comment as a hateful one if all of the heads classify it as hatespeech, which put a more stringent requirement on classifying hatespeech. On the other hand, HABERTOR will predict an input comment as a normal comment if at least one of the heads recognizes the input comment as a normal one, which is less strict. Similarly, using max pooling will put more restriction on declaring comments as normal, and less restriction on declaring hatespeech. Finally, mean pooling considers the average voting from all heads.
Note that our design generalizes the traditional fine-tuning BERT architecture when h=1 and the two classification nets share the same weights. Thus, HABERTOR is more flexible than the conventional fine-tuning BERT. Also, HABERTOR can be extended trivially to problems that have q sources, with h separated classification heads for q different sources. When predicting input sequences from new sources, HABERTOR averages the scores from all separated classification nets.

Parameter Estimation
Estimating parameters in the pretraining tasks in our model is similar to BERT, and we leave the details in the Appendix due to space limitation.
For hatespeech prediction task, we use the transformed embedding vector of the [CLS] token as a summarized embedding vector for the whole input sequence. Let S be a collection of sequences s i . Note that s i is a normal sequence, not corrupted or concatenated with another input sequence. Given that y i is the supervised ground truth label for the input sequence, andŷ i = P (y i |s i , "news") (Figure 1b,1b) where s i is a news input sequence, or y i = P (y i |s i ,"finance") when s i is a finance input sequence. The hateful prediction task aims to minimize the following binary cross entropy loss: Regularize with adversarial training: To make our model more robust to perturbations of the input embeddings, we further regularize our model with adversarial training. There exist several stateof-the-art target-based adversarial attacks such us Fast Gradient Method (FGM) (Miyato et al., 2017), Basic Iterative Method (Kurakin et al., 2016), and Carlini L2 attack (Carlini and Wagner, 2017). We use the FGM method as it is effective and efficient according to our experiments.
In FGM, the noise magnitude is a scalar value and is a manual input hyper-parameter. This is suboptimal, as different adversarial directions of differ-ent dimensions are scaled similarly, plus, manually tuning the noise magnitude is expensive and not optimal. Hence, we propose to extend FGM with a learnable and fine-grained noise magnitude, where the noise magnitude is parameterized by a learnable vector, providing different scales for different adversarial dimensions. Moreover, the running time of our proposal compared to FGM is similar.
The basic idea of the adversarial training is to add a small perturbation noise δ on each of the token embeddings that makes the model misclassify hateful comments as normal comments, and vice versa. Given the input sequence u ] with ground truth label y i , let y i be the adversarial target class of s i such that y i = y i . In the hatespeech detection domain, our model is a binary classifier. Hence, when y i = 1 (s i is a hateful comment),ỹ i = 0 and vice versa. Then, the perturbation noise δ is learned by minimizing the following cost function: Note that in Eq. (1), δ is constrained to be less than a predefined noise magnitude scalar in the traditional FGM method. In our proposal, δ is con- (1) is expensive and not easy, especially with complicated deep neural networks. Thus, we approximate each perturbation noise δ i for each input sequence s i by linearizing partial loss −logP ỹ i |s i + δ i ;θ around s i . Particularly, δ i is measured by: In Eq. (2), is a learnable vector, with the same dimensional size as δ i . Solving the constraint Finally, HABERTOR aims to minimize the following cost function: where 2 is an additional term to force the model to learn robustly as much as possible, and λ is a hyper-parameter to balance its effect. Note that, we first learn the optimal values of all token embeddings and HABERTOR's parameters before learning adversarial noise δ. Also, regularizing adversarial training only increases the training time, but not the inferencing time since it does not introduce extra parameters for the model during the inference.  (Wulczyn et al., 2017). The Twitter dataset consists of 16K annotated tweets, including 5,054 hateful tweets (i.e., 31%). The Wiki dataset has 115K labeled discussion comments from English Wikipedia talk's page, including 13,590 hateful comments (i.e., 12%). The statistics of 3 datasets are shown in Table 1. Train/Dev/Test split: We split the dataset into train/dev/test sets with a ratio 70%/10%/20%. We tune hyper-parameters on the dev set, and report final results on the test set. Considering critical mistakes reported at Arango et al. (2019) when building machine learning models (i.e. extracting features using the entire dataset, including testing data, etc), we generate vocabs, pre-train the two language modeling tasks, and train the hatespeech prediction task using only the training set.   smaller than BERT-base (i.e. 110M of parameters). The configuration comparison of HABERTOR-VAFOQF and other pretrained language models is given in Table 2. HABERTOR-VAFOQF has less than 2 times compared to TinyBERT's parameters, less than 9 times compared to Distil-BERT's size, and is equal to 0.59 AlBERT's size. Table 3 shows the performance of all models on Yahoo dataset. Note that we train on the Yahoo training set that contains both Yahoo News and Finance data, and report results on Yahoo News and Finance separately, and report only AUC and AP on both of them (denoted as column "Yahoo" in Table 3). We see that Fermi worked worst among all models. It is mainly because Fermi transfers the pre-trained embeddings from the USE model to a SVM classifier without further fine-tuning. This limits Fermi's ability to understand domain-specific contexts. Q-Transformer works the best among non-LM baselines, but worse than LM baselines as it is not pretrained. BERT-base performed the best among all baselines. Also, distilled models worked worse than BERT-base due to their compression nature on BERT-base as the teacher model. Next, we compare the performance of our proposed models against each other. Table 3 shows that our models' performance is decreasing when we compress more components (p-value < 0.05 un-der the directional Wilcoxon signed-rank test). We reason it is a trade-off between the model size and the model performance as factorizing a component will naturally lose some of its information.

Performance comparison
Then, we compared our proposed models with BERT-base -the best baseline. Table 3 shows that except our HABERTOR-VAFOQF, our other proposals outperformed BERT-base, improving by an average of 1.2% and 1.5% of F1-score in Yahoo News and Yahoo Finance, respectively (p-value < 0.05). Recall that in addition to improving hatespeech detection performance, our models' size is much smaller than BERT-base. For example, HABERTOR saved 84M of parameters from BERTbase, and HABERTOR-VAFQF saved nearly 100M of parameters from BERT-base. Interestingly, even our smallest HABERTOR-VAFOQF model (7.1M of parameters) achieves similar results compared to BERT-base (i.e. the performance difference between them is not significant under the directional Wilcoxon signed-rank test). Those results show the effectiveness of our proposed models against BERT-base, the best baseline, and consolidate the need of pretraining a language model on a hateful corpus for a better hateful language understanding.

Running time and memory comparison
Running time: Among LM baselines, TinyBERT is the fastest. Though ALBERT has the smallest number of parameters by adopting the cross-layer weight sharing mechanism, ALBERT has the same number of layers as BERT-base, leading to a similar computational expense as BERT-base.
Our HABERTOR-VQF and HABERTOR-VAQF have a very similar parameter size with TinyBERT and their train/inference time are similar. Interestingly, even though HABERTOR has 26M of parameters, its runtime is also competitive with Tiny-BERT. This is because among 26M of parameters in HABERTOR, 15.4M of its total parameters are for encoding 40k vocabs, which are not computational parameters and are only updated sparsely during training. HABERTOR-VAFQF and HABERTOR-VAFOQF significantly reduce the number of parameters compared to TinyBERT, leading to a speedup during training and inference phases. Especially, our experiments on 4 K80 GPUs with a batch size of 128 shows that HABERTOR-VAFOQF is 1.6 times faster than TinyBERT. Memory consumption: Our experiments with a batch size of 128 on 4 K80 GPUs show that among LM baselines, TinyBERT and AL-BERT are the most lightweight models, consuming 13GB of GPU memory. Compared to TinyBERT and ALBERT, HABERTOR takes an additional 4GB of GPU memory, while HABERTOR-VQF, HABERTOR-VAQF hold a similar memory consumption, HABERTOR-VAFQF and HABERTOR-VAFOQF reduces 1∼3 GB of GPU memory. Compared to BERT-base: In general, HABER-TOR is 4∼5 times faster, and 3.1 times GPU memory usage smaller than BERT-base. Our most lightweight model HABERTOR-VAFOQF even reduces 3.6 times GPU memory usage, while remains as effective as BERT-base. The memory saving in our models also indicates that we could increase the batch size to perform inference even faster.

Generalizability analysis
We perform hatespeech Language Model transfer learning on other hateful Twitter and Wiki datasets to understand our models' generalizability. We use our models' pre-trained language model checkpoint learned from Yahoo hateful datasets, and fine tune them on Twitter/Wiki datasets. Note that the finetuned training also includes regularized adversarial training for best performance. Next, we compare the performance of our models with Fermi and four LM baselines -best baselines reported in Table 3. Table 4 shows that BERT-base performed best compared to other fine-tuned LMs, which is consistent with our reported results on Yahoo datasets in Table 3. When comparing with BERT-base's performance (i.e. best baseline) on the Twitter dataset, all our models outperformed BERT-base. On Wiki dataset, interestingly, our models work very competitively with BERT-base, and achieve similar F1score results. Recall that BERT-base has a major advantage of pre-training on 2,500M Wiki words, thus potentially understands Wiki language styles and contexts better. In contrast, HABERTOR and its four factorized versions are pre-trained on 33M words from Yahoo hatespeech dataset. As shown in the ablation study (refer to AS2 in Section A.6 of the Appendix), a larger pre-training data size leads to better language understanding and a higher hatespeech prediction performance. Hence, if we acquire larger pre-training data with more hateful representatives, our model's performance can be further boosted. All of those results show that our models generalize well on other hatespeech datasets compared with BERT-base, with significant model complexity reduction.

Ablation study
Effectiveness of the adversarial attacking method FGM with our fined-grained and adaptive noise magnitude: To show the effectiveness of the FGM attacking method with our proposed fine-grained and adaptive noise magnitude, we  compare the performance of HABERTOR and its four factorized versions when (i) using a fixed and scalar noise magnitude as in the traditional FGM method, and (ii) using a fine-grained and adaptive noise magnitude in our proposal. We evaluate the results by performing the Language Model transfer learning on Twitter and Wiki datasets and present results in Table 5. Note that, the noise magnitude range is set in [1, 5] for both two cases (i) and (ii) for a fair comparison, and we manually search the optimal value of the noise magnitude in the traditional FGM method using the development set in each dataset. We observe that in all our five models, learning with our modified FGM produces better results compared to learning with a traditional FGM, confirming the effectiveness of our proposed finegrained and adaptive noise magnitude.
We also plot the histogram of the learned noise magnitude of HABERTOR on Twitter and Wiki datasets. Figure 2 shows that different embedding dimensions are assigned with different learned noise magnitude, showing the need of our proposed fine-grained and adaptive noise magnitude, that automatically assigns different noise scales for different embedding dimensions. Additional Ablation study: We conduct several ablation studies to understand HABERTOR's sensitivity. Due to space limitation, we summarize the key findings as follows, and leave detailed information and additional study results in the Appendix: (i) A large masking factor in HABERTOR is helpful to improve its performance; (ii) Pretraining with a larger hatespeech dataset or a more fine-grained pretraining can improve the hatespeech prediction performance; and (iii) Our fine-tuning architecture with multi-source and ensemble of classification heads helps improve the performance.

Further application discussion
Our proposals were designed for the hatespeech detection task, but in an extent, they can be applied for other text classification tasks. To illus- trate the point, we experiment our models (i.e. all our pretraining and fine-tuning designs) on a sentiment classification task. Particularly, we used 471k Amazon-Prime-Pantry reviews (McAuley et al., 2015), which is selected due to its reasonable size for fast pretraining, fine-tuning and result attainment. After some preprocessings (i.e. duplicated reviews removal, convert the reviews with rating scores ≥ 4 as positive, rating ≤ 2 as negative, and no neutral class for easy illustration), we obtained 301k reviews and splited into 210ktraining/30k-development/60k-testing with a ratio 70/10/20. Next, we pretrained our models on 210k training reviews which contain 5.06M of words. Then, we fine-tuned our models on the 210k training reviews, selected a classification threshold on the 30k development reviews, and report AUC, AP, and F1 on the 60k testing reviews. We compare the performance of our models with fine-tuned BERT-base and ALBERT-base -two best baselines. We observe that though pretraining on only 5.06M words of 210k training reviews, HABERTOR performs very similarly to BERT-base, while improving over ALBERT-base. Except HABERTOR-VAFOQF with a little bit smaller F1-score compared to ALBERT-base, our other three compressed models worked better than ALBERT-base, showing the effectiveness of our proposals.

Conclusion
In this paper, we presented the HABERTOR model for detecting hatespeech. HABERTOR understands the language of the hatespeech datasets better, is 4-5 times faster, uses less than 1/3 of the memory, and has a better performance in hatespeech classification. Overall, HABERTOR outperforms 15 state-of-the-art hatespeech classifiers and generalizes well to unseen hatespeech datasets, verifying not only its efficiency but also its effectiveness. v ]=[w 1 , ..., w n ] (n = u + v) with label y l where we already paired the sentences to generate a next (i.e y l = 1) or not next (i.e. y l = 0) training instance. Letc l be a corrupted sequence of c l , where we masked some tokens in c l . Denote C a collection of such training text sequences c l . The masked token prediction task aims to reconstruct each c l ∈ C given the corrupted sequencec l . In another word, the masked token prediction task maximizes the following log-likelihood: where 1 t is an indicator function and 1 t = 1 when the token t th is a [MASK] token, otherwise 1 t = 0. θ refers to all the model's learning parameters, w t is the ground truth token at position t th . Denote H θ (c l ) = [H θ (c l ) 1 , H θ (c l ) 2 , ..., H θ (c l ) n ] as the sequence of transformed output embedding vectors obtaining at the final layer of corresponding n tokens in the sequence c l . H θ (c l ) t ∈ R d with d is the embedding size. By parameterizing a linear layer with a transformation W 1 ∈ R V ×d (with V refers to the vocabulary size) as a decoder, we can rewrite L 1 as follows: where [·] t refers to the output value at position t.
For the next sentence prediction task, the objective is to minimize the following binary cross entropy loss function: where W 2 ∈ R d and H θ (c l ) 1 refers to the embedding vector of the first token in the sequence c l , or the [CLS] token. The intuition behind this is that the [CLS]'s embedding vector summarizes information of all other tokens via the attention Transformer network (Vaswani et al., 2017).
Then, pretraining with two language modeling tasks aims to minimize both loss functions L 1 and L 2 by: L LM = argmin θ L 1 + L 2

A.2 Quaternion
In mathematics, Quaternions 5 are a hypercomplex number system. A Quaternion number P in a Quaternion space H is formed by a real component (r) and three imaginary components as follows: where ijk = i 2 = j 2 = k 2 = −1. The noncommutative multiplication rules of quaternion numbers are: ij = k, jk = i, ki = j, ji = −k, kj = −i, ik = −j. In Equa (4), r, a, b, c are real numbers ∈ R. Note that r, a, b, c can also be extended to a real-valued vector ∈ R to obtain a Quaternion embedding, which we use to represent each word-piece embedding. Algebra on Quaternions: We present the Hamilton product on Quaternions, which is the heart of the linear Quaternion-based transformation. The Hamilton product (denoted by the ⊗ symbol) of two Quaternions P ∈ H and Q ∈ H is defined as: (5) Activation function on Quaternions: Similar to (Tay et al., 2019;Parcollet et al., 2019), we use a split activation function because of its stability and simplicity. Split activation function β on a Quaternion P is defined as: , where f is any standard activation function for Euclidean-based values. Why does a linear Quaternion transformation reduce 75% of parameters compared to the linear Euclidean transformation? Figure 3 shows a comparison between a traditional linear Euclidean transformation and a linear Quaternion-based transformation.
In Euclidean space, the same input will be multiplied with different weights to produce different output dimensions. Particularly, given a real-valued 4-dimensional vector [r in , a in , b in , c in ], we need to parameterize a weight matrix of 16 parameters (i.e. 16 degrees of freedom) to transform the 4dimensional input vector into a 4-dimensional output vector [r out , a out , b out , c out ]. However, with Quaternion transformation, the input vector now is represented with 4 components, where r in is the value of the real component, a in , b in , c in are the corresponding value of the three imaginary parts i, j, k, respectively. Because of the weight sharing nature of Hamilton product, different output dimensions take different combinations with the same input with exactly same 4 weighting parameters {r w , a w , b w , c w }. Thus, the Quaternions transformation reduces 75% of the number of parameters compared to the real-valued representations in Euclidean space. Quaternion-Euclidean conversion: Another excellent property of using Quaternion representations and Quaternion transformations is that converting from Quaternion to Euclidean and vice versa are convenient. To convert a real-valued based vector v ∈ R d into a Quaternion-based vector, we consider the first d/4 dimensions of v as the value of the real component, and the corresponding next three d/4 dimensions are for the three imaginary parts, respectively. Similarly, to convert a Quaternion vector v ∈ H d into a real-valued vector, we simply concatenate all four components of the Quaternion vector, and treat the concatenated vector as a real-valued vector in Euclidean space. Figure 4 presents a general view of the BERT architecture. Each BERT layer contains three parts: (i) attention, (ii) filtering, and (iii) output. The attention part parameterizes three weight transformation matrices H×H to form key, query, and value from the input, and another weight matrix H×H to transform the output attention results. The total parameters of this part are 4H 2 . The filtering part parameterizes a weight matrix H×4H to transform the output of the attention part, leading to a total of 4H 2 parameters. The output part parameterizes a weight matrix 4H×H to transform the output of the filtering part from 4H back to H, resulting in 4H 2 parameters.

A.3 Analysis on the BERT's Parameters
Thus, a BERT layer has 12H 2 parameters, and a BERT-base setting with 12 layers has 144H 2 parameters. By taking into account the number of parameters for encoding V vocabs, the total parameters of BERT is VH + 144H 2 .

A.4 50-50 Rule
To ensure the 50-50 rule, we perform the following method: Let M be the number of input text sequences that we can split into multiple sentences, and N be the number of input sequences that can be tokenized into only one sentence. We want the number of sentences to be generated as next sentence pairs (sampling with probability p 1 ) to be roughly equal to the number of sentences to be formed as not next sentence pairs (sampling with probability p 2 ). In another word,  . Since p 1 + p 2 = 1, replacing p 2 = 1 − p 1 , we have: . With p 1 established, we set p 1 as the probability for a sentence to be paired with another consecutive sentence in a same input sequence to generate a next sentence example.  (2019), we use pretrained BERT with 12 layers and uncased (our experiments show uncased works better than cased vocab) to perform fine-tuning for the hatespeech detection.
For baselines that require word embeddings, to maximize their performances, we initialize word embeddings with both GloVe pre-trained word embeddings (Pennington et al., 2014) and random initialization and report their best results. We implement BOW and NGRAM with Naive Bayes, Random Forest, Logistic Regression, and Xgboost classifiers, and then report the best results.
By default, our vocab size is set to 40k. The number of pretraining epochs is set to 60, and the batch size is set to 768. The learning rate is set to 5e-5 for the masked token prediction and next sentence prediction tasks, which are the two pretraining tasks, and 2e-5 for the hatespeech prediction task, which is the fine-tuning task. The default design of HABERTOR is given at Figure 1b, with one separated classification net with an ensemble of 2 heads for each input source. The masking factor τ is set to 10. The noise magnitude's bound constraint [a, b] = [1, 2] in Yahoo dataset, and [a, b] = [1,5] in Twitter and Wiki datasets. λ adv =1.0, and λ =1 in all three datasets. We use min pooling function to put a more stringent requirement on classifying hatespeech comments, as the number of hatespeech-labeled comments is the minority. All the pre-trained language models are fined-tuned with the Yahoo train set. For all other baselines, we vary the hidden size from {96, 192, 384} and report their best results. We build VDCNN with 4 convolutional blocks, which have 64, 128, 256 and 512 filters with a kernel size of 3, and 1 convolution layer. Each convolutional block includes two convolution layers. For FastText, we find that 1,2,3-grams and 1,2,3,4,5-character grams give the best performance. All models are optimized using Adam optimizer (Kingma and Ba, 2014).

A.6 Ablation Study
Effectiveness of regularized adversarial training and masking factor τ (AS1): Recall that by default, HABERTOR has 2 classification nets, each of the two nets has an ensemble of 2 classification heads, masking factor τ = 10, and is trained with regularized adversarial training. HABERTORadv indicates HABERTOR without regularized adversarial training, and HABERTOR -adv + τ =1 indicates HABERTOR without regularized adversarial training and masking factor τ of 1 instead of 10. Comparing HABERTOR with HABERTOR adv, we see a drop of AP by 1.16%, F1-score by 1.16%, and the average error rate increases by 0.78% (i.e. average of FPR@5%FNR and FNR@5%FPR). This shows the effectiveness of additional regularized adversarial training to make HABERTOR more robust. Furthermore, comparing HABERTOR -adv (with default τ =10) with HABERTOR -adv + τ = 1, we observe a drop of AP by 0.92%, F1-score by 0.24%, and an increment of average error rate by 1.01%. This shows the need of both regularized adversarial training with our proposed fine-grained and adaptive noise magnitude, and a large masking factor in HABER-TOR.
Is pretraining with a larger domain-specific dataset helpful? (AS2): We answer the question by answering a reverse question: does pretraining with smaller data reduce performance? We pre-train HABERTOR with 250k Yahoo comments data (4 times smaller), and 500k Yahoo comments data (2 times smaller). Then, we compare the results of HABERTOR -adv + τ = 1 with HABER-TOR -adv + τ = 1 under 250k data, and HABER-TOR -adv + τ = 1 under 500k data. Table 7 shows the results. We observe that pretraining with a larger data size increases the hatespeech prediction performance. We see a smaller drop when pretraining with 1M data vs 500k data (AP drops 0.6%), and a bigger drop when pretraining with 500k data vs 250k data (AP drops 4.4%). We reason that when the pretraining data size is too small, important linguistic patterns that may exist in the test set are not fully observed in the training set. In short, pretraining with larger hatespeech data can improve the hatespeech prediction performance. Note that BERT-base is pre-trained on 3,300M words, which are 106 times larger than HABERTOR (only 31M words). Hence, the performance of HABERTOR can be boosted further when pre-training a hatespeech language model with a larger number of hateful representatives.
Usefulness of separated source prediction and ensemble heads (AS3): We compare HABER-TOR from Default settings to using single source + single source (i.e. one classification head for all data sources, see Figure 1a), single head (i.e. multisource and each source has a single classification head, see Figure 1b), and using more ensemble heads (i.e. multi-source + more ensemble classification heads, see Figure 1b). Table 7 shows that the overall performance order is multi-source + ensemble of 2 heads > multi-source + single head > single source + single head, indicating the usefulness of our multi-source and ensemble of classification heads architecture in the fine-tuning phase. However, when the number of ensemble heads ≥ 4, we do not observe better performance.
Is pretraining two language modeling tasks helpful for the hatespeech detection task? (AS4) We compare HABERTOR-adv + τ = 1 with HABERTOR-adv + τ = 1 -pretraining, where we ignore the pretraining step and consider HABER-TOR as an attentive network for pure supervised learning with random parameter initialization. In Table 7, the performance of HABERTOR without the language model pretraining is highly downgraded: AUC drops ∼-2%, AP drops ∼-5%, FPR and FNR errors are ∼+9% and ∼+5% higher, respectively, and F1 drops -4%. These results show a significant impact of the pretraining tasks for hatespeech detection.
Is HABERTOR sensitive when varying its number of layers, attention heads, and embedding size? (AS5) In Table 7, we observe that HABER-TOR+3 layers and HABERTOR+4 layers work worse than HABERTOR (6 layers), indicating that a deeper model does help to improve hatespeech detection. However, when we increase the number of attention heads from 6 to 12, or decrease the number of attention heads from 6 to 4, we observe that the performance becomes worse. We reason that when we set the number of attention heads to 12, since there is no mechanism to constrain different attention heads to attend on different information, they may end up focusing on the similar things, as shown in (Clark et al., 2019). But when reducing the number of attention heads to 4, the model is not complex enough to attend on more relevant information, leading to worse performance. Similarly, when we reduce the embedding size from 384 in HABERTOR to 192, the performance is worse. Note that we could not perform experiments with larger embedding sizes and/or more number of layers due to high running time and memory consumption. However, we can see in Table 7 performance of smaller HABERTOR with 3 layers, 4 layers, or 192 hidden size still obtain slightly better than BERT-base results reported in Table 3. This again indicates the need for pretraining language models on hatespeech-related corpus for the hatespeech detection task.
Effectiveness of fine-grained pretraining  (AS6)? Since the pretraining phase is unsupervised, a question is how much fine-grained pretraining should we perform to get a good hatespeech prediction performance? Or how many pretraining epochs are good enough? To answer the question, we vary the number of the pretraining epochs from {10, 20, 30, ..., 60} before performing the fine-tuning phase with hatespeech classification task. We report the changes in AUC and AP of fine-tuned HABERTOR on the Yahoo dataset without performing regularized adversarial training in Figure 5. We observe that a more fine-grained pretraining helps to increase the hatespeech prediction results, which is similar to a recent finding at Liu et al. (2019), especially from 10 epochs to 40 epochs. However, after 40 epochs, the improvement is smaller.