Contextualizing Hate Speech Classifiers with Post-hoc Explanation

Hate speech classifiers trained on imbalanced datasets struggle to determine if group identifiers like “gay” or “black” are used in offensive or prejudiced ways. Such biases manifest in false positives when these identifiers are present, due to models’ inability to learn the contexts which constitute a hateful usage of identifiers. We extract post-hoc explanations from fine-tuned BERT classifiers to detect bias towards identity terms. Then, we propose a novel regularization technique based on these explanations that encourages models to learn from the context of group identifiers in addition to the identifiers themselves. Our approach improved over baselines in limiting false positives on out-of-domain data while maintaining and in cases improving in-domain performance.


Introduction
Hate speech detection is part of the ongoing effort to limit the harm done by oppressive and abusive language (Waldron, 2012;Gelber and McNamara, 2016;Gagliardone et al., 2015;Mohan et al., 2017). Performance has improved with access to more data and more sophisticated algorithms (e.g., Mondal et al., 2017;Silva et al., 2016;Del Vigna12 et al., 2017;Basile et al., 2019), but the relative sparsity of hate speech requires sampling using keywords (e.g., Olteanu et al., 2018) or sampling from environments with unusually high rates of hate speech (e.g., de Gibert et al., 2018;Hoover et al., 2019). Modern text classifiers thus struggle to learn a model of hate speech that generalizes to real-world applications (Wiegand et al., 2019).
A specific problem found in neural hate speech classifiers is their over-sensitivity to group identifiers like "Muslim", "gay", and "black", which are only hate speech when combined with the right * Code is available here "[F]or many Africans, the most threatening kind of ethnic hatred is black against black." -New York Times "There is a great discrepancy between whites and blacks in SA. It is … [because] blacks will always be the most backward race in the world." Anonymous user, Gab.com context (Dixon et al., 2018). In Figure 1 we see two documents containing the word "black" that a finetuned BERT model predicted to be hate speech, while only the second occurs in a hateful context. Neural text classifiers achieve state-of-the-art performance in hate speech detection, but are uninterpretable and can break when presented with unexpected inputs (Niven and Kao, 2019). It is thus difficult to contextualize a model's treatment of identifier words. Our approach to this problem is to use the Sampling and Occlusion (SOC) explanation algorithm, which estimates model-agnostic, posthoc feature importance (Jin et al., 2020). We apply this approach to the Gab Hate Corpus (Kennedy et al., 2020), a new corpus labeled for "hate-based rhetoric", and an annotated corpus from the Stormfront white supremacist online forum (de Gibert et al., 2018).
Based on the explanations generated via SOC, which showed models were biased towards group identifiers, we then propose a novel regularizationbased approach in order to increase model sensitivity to the context surrounding group identifiers. We apply regularization during training to the explanation-based importance of group identifiers, coercing models to consider the context surrounding them.
We find that regularization reduces the attention given to group identifiers and heightens the importance of the more generalizable features of hate speech, such as dehumanizing and insulting lan-guage. In experiments on an out-of-domain test set of news articles containing group identifiers, which are heuristically assumed to be non-hate speech, we find that regularization greatly reduces the false positive rate, while in-domain, out-of-sample classification performance is either maintained or improved.

Related Work
Our work is conceptually influenced by Warner and Hirschberg (2012), who formulated hate speech detection as disambiguating the use of offensive words from abusive versus non-abusive contexts. More recent approaches applied to a wide typology of hate speech (Waseem et al., 2017), build supervised models trained on annotated (e.g., Waseem and Hovy, 2016;de Gibert et al., 2018) or heuristically-labeled (Wulczyn et al., 2017;Olteanu et al., 2018) data. These models suffer from the highly skewed distributions of language in these datasets (Wiegand et al., 2019).
Research on bias in classification models also influences this work. Dixon et al. (2018) measured and mitigated bias in toxicity classifiers towards social groups, avoiding undesirable predictions of toxicity towards innocuous sentences containing tokens like "gay". Similarly, annotators' biases towards certain social groups were found to be magnified during classifier training Mostafazadeh Davani et al. (2020). Specifically within the domain of hate speech and abusive language, Park et al. (2018) and Sap et al. (2019) have defined and studied genderand racial-bias, emphasizing issues of undetected dialect variation and imbalanced training data, respectively. Techniques for bias reduction in these settings include data augmentation by training on less biased data, term swapping during training (i.e., swapping gender words), and using debiased word embeddings (Bolukbasi et al., 2016).
Complementing these works, we directly manipulate models' modeling of the context surrounding identifier terms by regularizing explanations of these terms. Specifically, we use post-hoc explanation algorithms to interpret and modulate finetuned language models like BERT (Devlin et al., 2018), which achieve state of the art performance on many hate speech detection tasks (MacAvaney et al., 2019;Mandl et al., 2019). We focus on post-hoc explanation approaches, which interpret model predictions without elucidating the mechanisms by which the model works (Guidotti et al., 2019). These explanations reveal either wordlevel (Ribeiro et al., 2016;Sundararajan et al., 2017) or phrase-level importance (Murdoch et al., 2018;Singh et al., 2019) of inputs to predictions.

Data
We selected two public corpora for our experiments which highlight the rhetorical aspects of hate speech, versus merely the usage of slurs and explicitly offensive language (see . The "Gab Hate Corpus" (GHC; Kennedy et al., 2020) is a large, random sample (N = 27,655) from the Pushshift.io data dump of the Gab network † , which we have annotated according to a typology of "hate-based rhetoric", a construct motivated by hate speech criminal codes outside the U.S. and social science research on prejudice and dehumanization. Gab is a social network with a high rate of hate speech (Zannettou et al., 2018;Lima et al., 2018) and populated by the "Alt-right" (Anthony, 2016;Benson, 2016). Similarly with respect to domain and definitions, de Gibert et al. (2018) sampled and annotated posts from the "Stormfront" web domain (Meddaugh and Kay, 2009) and annotated at the sentence level according to a similar annotation guide as used in the GHC.
Train and test splits were randomly generated for Stormfront sentences (80/20) with "hate" taken as a positive binary label, and a test set was compiled from the GHC by drawing a random stratified sample with respect to the "target population" tag (possible values including race/ethnicity target, gender, religious, etc.). A single "hate" label was created by taking the union of two main labels, "human degradation" and "calls for violence". Training data for the GHC (GHC train ) included 24,353 posts with 2,027 labeled as hate, and test data for the GHC (GHC test ) included 1,586 posts with 372 labeled as hate. Stormfront splits resulted in 7,896 (1,059 hate) training sentences, 979 (122) validation, and 1,998 (246) test.

Analyzing Group Identifier Bias
To establish and define our problem more quantitatively, we analyze hate speech models' bias towards group identifiers and how this leads to false positive errors during prediction. We analyze the top features of a linear model and use post-hoc explanations applied to a fine-tuned BERT model in order to measure models' bias towards these terms. † https://files.pushshift.io/gab/ We then establish the effect of these tendencies on model predictions using an adversarial-like dataset of New York Times articles.

Classification Models
We apply our analyses on two text classifiers, logistic regression with bag of words features and a fine-tuned BERT model (Devlin et al., 2018). The BERT model appends a special CLS token at the beginning of the input sentence and feeds the sentence into stacked layers of Transformer (Vaswani et al., 2017) encoders. The representation of the CLS token at the final layer is fed into a linear layer to perform 2-way classification (hate or non-hate). Model configuration and training details can be found in the Section A.3.

Model Interpretation
We first determine a model's sensitivity towards group identifiers by examining the models themselves. Linear classifiers can be examined in terms of their most highly-weighted features. We apply a post-hoc explanation algorithm for this task of extracting similar information from the fine-tuned methods discussed above.
Group identifiers in linear models From the top features in a bag-of-words logistic regression of hate speech on GHC train , we collected a set of twenty-five identity words (not restricted to social group terms, but terms identifying a group in general), including "homosexual", "muslim", and "black", which are used in our later analyses. The full list is in Supplementals (A.1).
Explanation-based measures State-of-the-art fine-tuned BERT models are able to model complicated word and phrase compositions: for example, some words are only offensive when they are composed with specific ethnic groups. To capture this, we apply a state-of-the-art Sampling and Occlusion (SOC) algorithm which is capable of generating hierarchical explanations for a prediction.
To generate hierarchical explanations, SOC starts by assigning importance score for phrases in a way that eliminates compositional effect between the phrase and its context x δ around it within a window. Given a phrase p appearing in a sentence x, SOC assigns an importance score φ(p) to show how the phrase p contribute so that the sentence is classified as a hate speech. The algorithm computes the difference of the unnormalized prediction score s(x) between "hate" and "non-hate" in the 2-way classifier. Then the algorithm evaluates average change of s(x) when the phrase is masked with padding tokens (noted as x\p) for different inputs, in which the N -word contexts around the phrase p are sampled from a pretrained language model, while other words remain the same as the given x. Formally, the importance score φ(p) is measured as, In the meantime, SOC algorithm perform agglomerative clustering over explanations to generate a hierarchical layout. Averaged Word-level SOC Explanation Using SOC explanations output on GHC test , we compute average word importance and present the top 20 in Table 2.

Bias in Prediction
Hate speech models can be over-attentive to group identifiers, as we have seen by inspecting them through feature analysis and a post-hoc explanation approach. The effect of this during prediction is that models over-associate these terms with hate speech and choose to neglect the context around the identifier, resulting in false positives. To provide an external measure of models' over-sensitivity to group identifiers, we construct an adversarial test set of New York Times (NYT) articles that are filtered to contain a balanced, random sample of the twenty-five group identifiers (Section A.1). This gives us 12, 500 documents which are devoid of hate speech as defined by our typologies, excepting quotation. It is key for models to not ignore identifiers, but to match them with the right context.  shows the effect of ignoring identifiers: random subsets of words ranging in size from 0 to 25 are removed, with each subset sample size repeated 5 times. Decreased rates of false positives on the NYT set are accompanied by poor performance in hate speech detection.

Contextualizing Hate Speech Models
We have shown hate speech models to be oversensitive to group identifiers and unable to learn from the context surrounding these words during training. To address this problem in state-of-the-art models, we propose that models can be regularized to give no explained importance to identifier terms. We explain our approach as well as a naive baseline based on removing these terms. Word Removal Baseline. The simplest approach is to remove group identifiers altogether. We remove words from the term list found in Section A.1 from both training and testing sentences. Explanation Regularization. Given that SOC explanations are fully differentiable, during training, we regularize SOC explanations on the group identifiers to be close to 0 in addition to the classification objective L . The combined learning objective is written as follows.
where S notes for the set of group names and x notes for the input word sequence. α is a hyperparameter for the strength of the regularization. In addition to SOC, we also experiment with regularizing input occlusion (OC) explanations, defined as the prediction change when a word or phrase is masked out, which bypass the sampling step in SOC.

Experiment Details
Balancing performance on hate speech detection and the NYT test set is our quantitative measure of how well a model has learned the contexts in which group identifiers are used for hate speech. We apply our regularization approach to this task, and compare with a word removal strategy for the fine-tuned BERT model. We repeat the process for both the GHC and Stormfront, evaluating test set hate speech classification in-domain and accuracy on the NYT test set. For the GHC, we used the full list of 25 terms; for Stormfront, we used the 10 terms which were also found in the top predictive features in linear classifiers for the Stormfront data. Congruently, for Stormfront we filtered the NYT corpus to only contain these 10 terms (N = 5,000).

Results
Performance is reported in Table 1. For the GHC, we see an improvement for in-domain hate speech classification, as well as an improvement in false positive reduction on the NYT corpus. For Stormfront, we see the same improvements for in-domain F 1 ) and NYT. For the GHC, the most marked difference between BERT+WR and BERT+SOC is increased recall, suggesting that baseline removal largely mitigates bias towards identifiers at the cost of more false negatives.
As discussed in section 4.2, SOC eliminates the compositional effects of a given word or phrase. As a result, regularizing SOC explanations does not prohibit the model from utilizing contextual information related to group identifiers. This can possibly explain the improved performance in hate speech detection relative to word removal. Word Importance in Regularized Models We determined that regularization improves a models focus on non-identifier context in prediction. In table 2 we show the changes in word importance as measured by SOC. Identity terms' importance decreases, and we also see a significant increase in importance of terms related to hate speech ("poisoned", "blamed", etc.) suggesting that models have learned from the identifier terms' context. Visualizing Effects of Regularization We can further see the effect of regularization by considering Figure 3, where hierarchically clustered expla- 60.56 ± 1.8 69.72 ± 3.6 64.14 ± 3.2 89.43 ± 4.3 57.47 ± 3.7 51.10 ± 4.4 53.82 ± 1.3 95.39 ± 2.3 BERT + SOC (α=0.1) 70.17 ± 2.5 69.03 ± 3.0 69.52 ± 1.3 83.16 ± 5.0 57.29 ± 3.4 54.27 ± 3.3 55.55 ± 1.1 93.93 ± 3.6 BERT + SOC (α=1.0) 64.29 ± 3.1 69.41 ± 3.8 66.67 ± 2.5 90.06 ± 2.6 56.05 ± 3.9 54.35 ± 3.4 54.97 ± 1.1 95.40 ± 2.0 Table 1: Precision, recall, F 1 (%) on GHC test and Stormfront (Stf.) test set and accuracy (%) on NYT evaluation set. We report mean and standard deviation of the performance across 10 runs for BERT, BERT + WR (word removal), BERT + OC, and BERT + SOC.  nations from SOC are visualized before and after regularization, correcting a false positive.

Conclusion & Future Work
Regularizing SOC explanations of group identifiers tunes hate speech classifiers to be more contextsensitive and less reliant on high-frequency words in imbalanced training sets. Complementing prior work in bias detection and removal in the context of hate speech and in other settings, our method is directly integrated into Transformer-based models and does not rely on data augmentation. As such, it is an encouraging technique towards directing models' internal representation of target phenomena via lexical anchors. Future work includes direct extension and validation of this technique with other language models such as GPT-2 (Radford et al., 2019); experimenting with other hate speech or offensive lan-guage datasets; and experimenting with these and other sets of identity terms. Also motivated by the present work is the more general pursuit of integrating structure into neural models like BERT.
Regularized hate speech classifiers increases sensitivity to the compositionality of hate speech, but the phenomena remain highly complex rhetorically and difficult to learn through supervision. For example, this post from the GHC requires background information and reasoning across sentences in order to classify as offensive or prejudiced: "Donald Trump received much criticism for referring to Haiti, El Salvador and Africa as 'shitholes'. He was simply speaking the truth." The examples we presented (see Appendix 4 and 5) show that regularization leads to models that are context-sensitive to a degree, but not to the extent of reasoning over sentences like those above. We hope that the present work can motivate more attempts to inject more structure into hate speech classification.
Explanation algorithms offer a window into complex predictive models, and regularization as performed in this work can improve models' internal representations of target phenomena. In this work, we effectively applied this technique to hate speech classifiers biased towards group identifiers; future work can determine the effectiveness and further potential for this technique in other tasks and contexts.

A Appendices
A.1 Full List of Curated Group Identifiers muslim jew jews white islam blacks muslims women whites gay black democat islamic allah jewish lesbian transgender race brown woman mexican religion homosexual homosexuality africans

A.3 Implementation Details
Training Details. We fine-tune over the BERTbase model using the public code ‡ , where the batch size is set to 32 and the learning rate of the Adam (Kingma and Ba, 2015) optimizer is set to 2 × 10 −5 . The validation is performed every 200 iterations and the learning rate is halved when the validation F1 decreases. The training stops when the learning rate is halved for 5 times. To handle the data imbalance issue, we reweight the training loss so that positive examples are weighted 10 ‡ https://github.com/huggingface/ transformers Explanation Algorithm Details. For the SOC algorithm, we set the number of samples and the size of the context window as 20 and 20 respectively for explanation analysis, and set two parameters as 5 and 5 respectively for explanation regularization.

A.4 Cross-Domain Performance
In addition to evaluating each model within-domain (i.e., training on GHC train and evaluating on GHC test ) we evaluated each model across domains. The results of these experiments, conducted in the same way as before, are presented in Table 5.  Table 5: Cross domain F1 on Gab, Stormfront (Stf.) datasets. We report mean and standard deviation of the performance within 10 runs for BERT, BERT + WR (word removal), BERT + OC, and BERT + SOC.