A little goes a long way: Improving toxic language classification despite data scarcity

Detection of some types of toxic language is hampered by extreme scarcity of labeled training data. Data augmentation – generating new synthetic data from a labeled seed dataset – can help. The efficacy of data augmentation on toxic language classification has not been fully explored. We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to BERT – a state-of-the-art pretrained Transformer network. We compare the performance of eight techniques on very scarce seed datasets. We show that while BERT performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including GPT-2-generated sentences. We discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints.


Introduction
We systematically compared eight augmentation techniques on four classifiers, ranging from shallow architectures to BERT (Devlin et al., 2019), a popular pre-trained Transformer network. We used downsampled variants of the Kaggle Toxic Comment Classification Challenge dataset (Jigsaw 2018; §3) as our seed dataset. We focused on the threat class, but also replicated our results on another toxic class ( §4.6). With some classifiers, we reached the same F1-score as when training on the original dataset, which is 20x larger. However, performance varied markedly between classifiers.
We obtained the highest overall results with BERT, increasing the F1-score up to 21% compared to training on seed data alone. However, augmentation using a fine-tuned GPT-2 ( §3.2.4)a pre-trained Transformer language model (Radford et al., 2019) -reached almost BERT-level performance even with shallow classifiers. Combining multiple augmentation techniques, such as adding majority class sentences to minority class documents ( §3.2.3) and replacing subwords with embedding-space neighbors (Heinzerling and Strube, 2018) ( §3.2.2), improved performance on all classifiers. We discuss the interplay of performance and computational requirements like memory and run-time costs ( §4.5). We release our source code. 1

Preliminaries
Data augmentation arises naturally from the problem of filling in missing values (Tanner and Wong, 1987). In classification, data augmentation is applied to available training data. Classifier performance is measured on a separate (non-augmented) test set (

Dataset
We used Kaggle's toxic comment classification challenge dataset (Jigsaw, 2018). It contains human-labeled English Wikipedia comments in six different classes of toxic language. 2 The median length of a document is three sentences, but the distribution is heavy-tailed (Table 1).
Mean Std. Min Max 25% 50% 75% 4 6 1 683 2 3 5 Our experiments concern binary classification, where one class is the minority class and all remaining documents belong to the majority class. We focus on threat as the minority class, as it poses the most challenge for automated analysis in this dataset (van Aken et al., 2018). To confirm our results, we also applied the best-performing techniques on a different type of toxic language, the identity-hate class ( §4.6).
Our goal is to understand how data augmentation improves performance under extreme data scarcity in the minority class (threat). To simulate this, we derive our seed dataset (SEED) from the full data set (GOLD STANDARD) via stratified bootstrap sampling (Bickel and Freedman, 1984) to reduce the dataset size k-fold. We replaced newlines, tabs and repeated spaces with single spaces, and lowercased each dataset. We applied data augmentation techniques on SEED with k-fold oversampling of the minority class, and compared each classifier architecture ( §3.3) trained on SEED, GOLD STAN-DARD, and the augmented datasets. We used the original test dataset (TEST) for evaluating performance. We detail the dataset sizes in Table 2  Ethical considerations. We used only public datasets, and did not involve human subjects.

Data augmentation techniques
We evaluated six data augmentation techniques on four classifiers (Table 3). We describe each aug-mentation technique (below) and classifier ( §3.3). For comparison, we also evaluated simple oversampling (COPY) and EDA (Wei and Zou, 2019), both reviewed in §2. Following the recommendation of Wei and Zou (2019) for applying EDA to small seed datasets, we used 5% augmentation probability, whereby each word has a 1 − 0.95 4 ≈ 19% probability of being transformed by at least one of the four EDA techniques. Four of the six techniques are based on replacing words with semantically close counterparts; two using semantic knowledge bases ( §3.2.1) and two pre-trained embeddings ( §3.2.2). We applied 25% of all possible replacements with these techniques, which is close to the recommended substitution rate in EDA. For short documents we ensured that at least one substitution is always selected. We also added majority class material to minority class documents ( §3.2.3), and generated text with the GPT-2 language model fine-tuned on SEED ( §3.2.4).

Substitutions from a knowledge base
WordNet is a semantic knowledge base containing various properties of word senses, which correspond to word meanings (Miller, 1995). We augmented SEED by replacing words with random synonyms. While EDA also uses WordNet synonyms ( §2), we additionally applied word sense disambiguation (Navigli, 2009) and inflection.
For word sense disambiguation we used simple Lesk from PyWSD (Tan, 2014). As a variant of the Lesk algorithm (Lesk, 1986) it relies on overlap in definitions and example sentences (both provided in WordNet), compared between each candidate sense and words in the context. Word senses appear as uninflected lemmas, which we inflected using a dictionary-based technique. We lemmatized and annotated a large corpus with NLTK (Bird et al., 2009), and mapped each <lemma, tag> combination to its most common surface form. The corpus contains 8.5 million short sentences (≤ 20 words) from multiple open-source corpora (see Appendix E). We designed it to have both a large vocabulary for wide coverage (371125 lemmas), and grammatically simple sentences to maximize correct tagging. Paraphrase Database (PPDB) was collected from bilingual parallel corpora on the premise that English phrases translated identically to another language tend to be paraphrases (Ganitkevitch et al., 2013;Pavlick et al., 2015). We used phrase pairs tagged as equivalent, constituting 245691 para-phrases altogether. We controlled substitution by grammatical context as specified in PPDB. In single words this is the part-of-speech tag; whereas in multi-word paraphrases it also contains the syntactic category that appears after the original phrase in the PPDB training corpus. We obtained grammatical information with the Spacy 3 parser.

Embedding neighbour substitutions
Embeddings can be used to map units to others with a similar occurrence distribution in a training corpus (Mikolov et al., 2013). We considered two alternative pre-trained embedding models. For each model, we produced top-10 nearest embedding neighbours (cosine similarity) of each word selected for replacement, and randomly picked the new word from these.

Majority class sentence addition (ADD)
Adding unrelated material to the training data can be beneficial by making relevant features stand out (Wong et al., 2016; Shorten and Khoshgoftaar, 2019). We added a random sentence from a majority class document in SEED to a random position in a copy of each minority class training document.

GPT-2 conditional generation
GPT-2 is a Transformer language model pre-trained on a large collection of Web documents. We used the 110M parameter GPT-2 model from the Transformers library (Wolf et al., 2019) We discuss parameters in Appendix F. We augmented as follows (N -fold oversampling):

Classifiers
Char-LR and Word-LR. We adapted the logistic regression pipeline from the Wiki-detox project (Wulczyn et al., 2017). 5 We allowed n-grams in the range 1-4, and kept the default parameters: TF-IDF normalization, vocabulary size at 10, 000 and parameter C = 10 (inverse regularization strength). CNN. We applied a word-based CNN model with 10 kernels of sizes 3, 4 and 5. Vocabulary size was 10, 000 and embedding dimensionality 300. For training, we used the dropout probability of 0.1, and the Adam optimizer (Kingma and Ba, 2014) with the learning rate of 0.001. BERT. We used the pre-trained Uncased BERT-Base and trained the model with the training script from Fast-Bert. 6 We set maximum sequence length to 128 and mixed precision optimization level to O1.
score for each classifier and augmentation technique. (For brevity, we use "F1-score" from now on.) The majority class F1-score remained 1.00 (two digit rounding) across all our experiments. All classifiers are binary, and we assigned predictions to the class with the highest conditional probability. We relax this assumption in §4.4, to report area under the curve (AUC) values (Murphy, 2012).
To validate our results, we performed repeated experiments with the common random numbers technique (Glasserman and Yao, 1992), by which we controlled the sampling of SEED, initial random weights of classifiers, and the optimization procedure. We repeated the experiments 30 times, and report confidence intervals.

Results without augmentation
We first show classifier performance on GOLD STANDARD and SEED in Table 4. van Aken et al. (2018) reported F1-scores for logistic regression and CNN classifiers on GOLD STANDARD. Our results are comparable. We also evaluate BERT, which is noticeably better on GOLD STANDARD, particularly in terms of threat recall.
All classifiers had significantly reduced F1scores on SEED, due to major drops in threat recall. In particular, BERT was degenerate, assigning all documents to the majority class in all 30 repetitions. Devlin et al. (2019) report that such behavior may occur on small datasets, but random restarts may help. In our case, random restarts did not impact BERT performance on SEED.

Augmentations
We applied all eight augmentation techniques ( §3.2) to the minority class of SEED (threat). Each  technique retains one copy of each SEED document, and adds 19 synthetically generated documents per SEED document. Table 5 summarizes augmented dataset sizes. We present our main results in Table 6. We first discuss classifier-specific observations, and then make general observations on each augmentation technique.

GOLD STANDARD
SEED Augmented Minority 25 25→500 Majority 7955 7955 We compared the impact of augmentations on each classifier, and therefore our performance comparisons below are local to each column (i.e., classifier). We identify the best performing technique for the three metrics and report the p-value when its effect is significantly better than the other techniques (based on one-sided paired t-tests, α = 5%). 7 BERT. COPY and ADD were successful on BERT, raising the F1-score up to 21 percentage points above SEED to 0.71. But their impacts on BERT were different: ADD led to increased recall, while COPY resulted in increased precision. PPDB precision and recall were statistically indistinguishable from COPY, which indicates that it did few alterations. GPT-2 led to significantly better recall (p < 10 −5 for all pairings), even surpassing GOLD STANDARD. Word substitution methods like EDA, WORDNET, GLOVE, and BPEMB improved on 7 The statistical significance results apply to this dataset, but are indicative of the behavior of the techniques in general.
SEED, but were less effective than COPY in both precision and recall. Park et al. (2019) found that BERT may perform poorly on out-of-domain samples. BERT is reportedly unstable on adversarially chosen subword substitutions (Sun et al., 2020). We suggest that non-contextual word embedding schemes may be sub-optimal for BERT since its pre-training is not conducted with similarly noisy documents. We verified that reducing the number of replaced words was indeed beneficial for BERT (Appendix G). Char-LR. BPEMB and ADD were effective at increasing recall, and reached similar increases in F1-score. GPT-2 raised recall to GOLD STAN-DARD level (p < 10 −5 for all pairings), but precision remained 16 percentage points below GOLD STANDARD. It led to the best increase in F1-score: 16 percentage points above SEED (p < 10 −3 for all pairings). Word-LR. Embedding-based BPEMB and GLOVE increased recall by at least 13 percentage points, but the conceptually similar PPDB and WORD-NET were largely unsuccessful. We suggest this discrepancy may be due to WORDNET and PPDB relying on written standard English, whereas toxic language tends to be more colloquial. GPT-2 increased recall and F1-score the most: 15 percentage points above SEED (p < 10 −10 for all pairings). CNN. GLOVE and ADD increased recall by at least 10 percentage points. BPEMB led to a large increase in recall, but with a drop in precision, possibly due to its larger capacity to make changes in text -GLOVE can only replace entire words that exist in the pre-training corpus. GPT-2 yielded the largest increases in recall and F1-score (p < 10 −4 for all pairings).
We now discuss each augmentation technique. COPY emphasizes the features of original minority documents in SEED, which generally resulted in fairly high precision. On Word-LR, COPY is analogous to increasing the weight of words that appear in minority documents. EDA behaved similarly to COPY on Char-LR, Word-LR and CNN; but markedly worse on BERT. ADD reduces the classifier's sensitivity to irrelevant material by adding majority class sentences to minority class documents. On Word-LR, ADD is analogous to reducing the weights of majority class words. ADD led to a marginally better F1-score than any other technique on BERT.

Augmentation Metric
Char-LR Word-LR CNN BERT

COPY
Word replacement was more effective with GLOVE and BPEMB than with PPDB or WORD-NET. PPDB and WORDNET generally replace few words per document, which often resulted in similar performance to COPY. BPEMB was generally the most effective among these techniques. GPT-2 had the best improvement overall, leading to significant increases in recall across all classifiers, and the highest F1-score on all but BERT. The increase in recall can be attributed to GPT-2's capacity for introducing novel phrases. We corroborated this hypothesis by measuring the overlap between the original and augmented test sets and an offensive/profane word list from von Ahn. 8 GPT-2 augmentations increased the intersection cardinality by 260% from the original; compared to only 84% and 70% with the next-best performing augmentation techniques (ADD and BPEMB, respectively). This demonstrates that GPT-2 significantly increased the vocabulary range of the training set, specifically with offensive words likely to be relevant for toxic language classification. However, there is a risk that human annotators might not label GPT-2-generated documents as toxic. Such label noise may decrease precision. (See Appendix H, Table 22 for example augmentations that display the behavior of GPT-2 and other techniques.)

Mixed augmentations
In §4.2 we saw that the effect of augmentations differ across classifiers. A natural question is whether it is beneficial to combine augmentation techniques. For all classifiers except BERT, the best performing techniques were GPT-2, ADD, and BPEMB (Table 6). They also represent each of our augmentation types ( §3.2), BPEMB having the highest performance among the four word replacement techniques ( §3.2.1- §3.2.2) in these classifiers.
We combined the techniques by merging augmented documents in equal proportions. In ABG, we included documents generated by ADD, BPEMB or GPT-2. Since ADD and BPEMB impose significantly lower computational and memory requirements than GPT-2, and require no access to a GPU (Appendix C), we also evaluated combining only ADD and BPEMB (AB).
ABG outperformed all other techniques (in F1score) on Char-LR and CNN with statistical significance, while being marginally better on Word-LR. On BERT, ABG achieved a better F1-score and precision than GPT-2 alone (p < 10 −10 ), and a better recall (p < 0.05). ABG was better than AB in recall on Word-LR and CNN, while the precision was comparable.
Augmenting with ABG resulted in similar performance as GOLD STANDARD on Word-LR, Char-LR and CNN (Table 4). Comparing Tables 6 and 7, it is clear that much of the performance improvement came from the increased vocabulary coverage of GPT-2-generated documents. Our results suggest that in certain types of data like toxic language, consistent labeling may be more important than wide coverage in dataset collection, since auto-AB  Table 7: Effects of mixed augmentation (20x) on SEED/threat (Annotations as in Table 6). Precision and recall for threat; F1-score macro-averaged from both classes. mated data augmentation can increase the coverage of language. Furthermore, Char-LR trained with ABG was comparable (no statistically significant difference) to the best results obtained with BERT (trained with ADD, p > 0.2 on all metrics).

Average classification performance
The results in Tables 6 and 7 focus on precision, recall and the F1-score of different models and augmentation techniques where the probability threshold for determining the positive or negative class is 0.5. In general the level of precision and recall are adapted based on the use case for the classifier.
Another general evaluation of a classifier is based on the ROC-AUC metric, which is the area under the curve for a plot of true-positive rate versus the false-positive rate for a range of thresholds varying over [0, 1]. Table 8 shows the ROC-AUC scores for each of the classifiers for the best augmentation techniques from Tables 6 and 7. BERT with ABG gave the best ROC-AUC value of 0.977 which is significantly higher than BERT with any other augmentation technique (p < 10 −6 ). CNN exhibited a similar pattern: ABG resulted in the best ROC-AUC compared to the other augmentation techniques (p < 10 −6 ). For Word-LR, ROC-AUC was highest for ABG, but the difference to GPT-2 was not statistically significant (p > 0.05).
In the case of Char-LR, none of the augmentation techniques improved on SEED (p < 0.05). Char-LR produced a more consistent averaged performance across all augmentation methods with ROC-AUC values varying between (0.958, 0.973), compared to variations across all augmentation techniques of (0.792, 0.962) and (0.816, 0.977) for CNN and BERT respectively.   Table 6).
Our results highlight a difference between the results in Tables 6 and 7: while COPY reached a high F1-score on BERT, our results on ROC-AUC highlight that such performance may not hold while varying the decision threshold. We observe that a combined augmentation method such as ABG provides an increased ability to vary the decision threshold for the more complex classifiers such as CNN and BERT. Simpler models performed consistently across different augmentation techniques.

Computational requirements
BERT has significant computational requirements (Table 9). Deploying BERT on common EC2 instances requires 13 GB GPU memory. ABG on EC2 requires 4 GB GPU memory for approximately 100s (for 20x augmentation). All other techniques take only a few seconds on ordinary desktop computers (See Appendices C-D for additional data on computational requirements).

Alternative toxic class
In order to see whether our results described so far generalize beyond threat, we repeated our experiments using another toxic language class, identity-hate, as the minority class. Our results for identity-hate are in line with those for threat. All classifiers performed poorly on SEED due to very low recall. Augmentation with simple techniques helped BERT gain more than 20 percentage points for the F1-score. Shallow classifiers approached BERT-like performance with appropriate augmentation. We present further details in Appendix B.

Discussion and conclusions
Our results highlight the relationship between classification performance and computational overhead. Overall, BERT performed the best with data augmentation. However, it is highly resource-intensive ( §4.5). ABG yielded almost BERT-level F1-and ROC-AUC scores on all classifiers. While using GPT-2 is more expensive than other augmentation techniques, it has significantly less requirements than BERT. Additionally, augmentation is a one-time upfront cost in contrast to ongoing costs for classifiers. Thus, the trade-off between performance and computational resources can influence which technique is optimal in a given setting.
We identify the following further topics that we leave for future work. SEED coverage. Our results show that data augmentation can increase coverage, leading to better toxic language classifiers when starting with very small seed datasets. The effects of data augmentation will likely differ with larger seed datasets. Languages. Some augmentation techniques are limited in their applicability across languages. GPT-2, WORDNET, PPDB and GLOVE are available for certain other languages, but with less coverage than in English. BPEMB is nominally available in 275 languages, but has not been thoroughly tested on less prominent languages.  (Schuster et al., 2020). We leave such considerations for future work. A Class overlap and interpretation of "toxicity"

Acknowledgments
Kaggle's toxic comment classification challenge dataset 9 contains six classes, one of which is called toxic. But all six classes represent examples of toxic speech: toxic, severe toxic, obscene, threat, insult, and identity-hate.
Of the threat documents in the full training dataset (GOLD STAN-DARD), 449/478 overlap with toxic. For identity-hate, overlap with toxic is 1302/1405. Therefore, in this paper, we use the term toxic more generally, subsuming threat and identity-hate as particular types of toxic speech. To confirm that this was a reasonable choice, we manually examined the 29 threat datapoints not overlapping with toxic. All of these represent genuine threats, and are hence toxic in the general sense.   To see if our results generalize beyond threat, we experimented on the identity-hate class in Kaggle's toxic comment classification dataset. Again, we used a 5% stratified sample of GOLD STANDARD as SEED. We first show the number of samples in GOLD STANDARD, SEED and TEST in Table 10. There are approximately 3 times more minority-class samples in identity-hate than in threat. Next, we show classifier performance 9 https://www.kaggle.com/c/jigsawtoxic-comment-classification-challenge on GOLD STANDARD/identity-hate in Table 11. The results closely resemble those on GOLD STANDARD/threat in Table 4 ( §4.1).
We compared SEED and COPY with the techniques that had the highest performance on threat: ADD, BPEMB, GPT-2, and their combination ABG. Table 12 shows the results.
Like in threat, BERT performed the poorest on SEED, with the lowest recall (0.06). All techniques decreased precision from SEED, and all increased recall except COPY with CNN. With COPY, the F1-score increased with Char-LR (0.12) and BERT (0.21), but not Word-LR (0.01) or CNN (−0.04). This is in line with corresponding results from threat ( §4.2, Table 6): COPY did not help either of the word-based classifiers (Word-LR, CNN) but helped the character-and subword-based classifiers (Char-LR, BERT).
Of the individual augmentation techniques, ADD increased the F1-score the most with Char-LR (0.15) and BERT (0.20); and GPT-2 increased it the most with Word-LR (0.07) and CNN (0.07). Here again we see the similarity between the two word-based classifiers, and the two that take inputs below the word-level. Like in threat, COPY and ADD achieved close F1-scores with BERT, but with different relations between precision and recall. BPEMB was not the best technique with any classifier, but increased F1-score everywhere except in CNN, where precision dropped drastically.
In the combined ABG technique, Word-LR and CNN reached their highest F1-score increases (0.08 and 0.07, respectively). With Char-LR F1-score was also among the highest, but did not reach ADD. Like with threat, ABG increased precision and recall more than GPT-2 alone.
Overall, our results on identity-hate closely resemble those we received in threat, resulting in more than 20 percentage point increases in the F1-score for BERT on augmentations with COPY and ADD. Like in threat, the impact of most augmentations was greater on Char-LR than on Word-LR or CNN. Despite their similar F1-scores in SEED, Char-LR exhibited much higher precision, which decreased but remained generally higher than with other classifiers. Combined with an increase in recall to similar or higher levels than with other classifiers, Char-LR reached BERT-level performance with proper data augmentation.

Augmentation Metric
Char-LR Word-LR CNN BERT

SEED No Oversampling
Precision 0.85 ± 0.04 0.59 ± 0.05 0.52 ± 0.08 0.65 ± 0.46 Recall 0.11 ± 0.04 0.12 ± 0.03 0.11 ± 0.04 0.06 ± 0.10 F1 (macro) 0.60 ± 0.03 0.60 ± 0.02 0.59 ± 0.02 0.54 ± 0.08  Table 12: Comparison of augmentation techniques for 20x augmentation on SEED/identity-hate: means for precision, recall and macro-averaged F1-score shown with standard deviations (10 repetitions). Precision and recall for identity-hate; F1-score macro-averaged from both classes. We provide library versions in Table 14. We used sklearn.metrics.precision recall fscore support 10 for calculating minority-class precision, recall and macro-averaged F1-score. For the first two, we applied pos label=1, and set average = 'macro' for the third. For ROC-AUC, we used sklearn.metrics.roc auc score 11 with default parameters. For t-tests, we used scipy.stats.ttest rel 12 , 10 https://scikit-learn.org/stable/ modules/generated/sklearn.metrics.roc_ auc_score.html 11 https://scikit-learn.org/stable/ modules/generated/sklearn.metrics.roc_ auc_score.html 12 https://docs.scipy.org/doc/scipy/    lease), 20 and WordNet example sentences. 21 The rationale for the corpus was to have a large vocabulary along with relatively simple grammatical structures, to maximize both coverage and the correctness of POS-tagging. We mapped each lemma-POS-pair to its most common inflected form in the corpus. When performing synonym replacement in WORDNET augmentation, we lemmatized and POS-tagged the original word with NLTK, chose a random synonym for it, and then inflected the synonym with the original POS-tag if it was present in the inflection dictionary.  In §4.2 - §4.4, we generated novel documents with GPT-2 fine-tuned on threat documents in SEED for 2 epochs. In Table 17, we show the impact of changing the number of fine-tuning epochs for GPT-2. Precision generally increased as the number of epochs was increased. However, recall simultaneously decreased.

G Ablation study
In §4.2 - §4.4, we investigated several word replacement techniques with a fixed change rate. In those experiments, we allowed 25% of possible replacements. Here we study each augmentation technique's sensitivity to the replacement rate. As done in previous experiments, we ensured that at least one augmentation is always performed. Experiments are shown in tables 18-21.
Interestingly, all word replacements decreased classification performance with BERT. We suspect this occurred because of the pre-trained weights in

BERT.
We show threat precision, recall and macroaveraged F1-scores for PPDB in Table 18. Changing the substitution rate had very little impact to the performance on any classifier. This indicates that there were very few n-gram candidates that could be replaced. We show results on WORDNET in  Table 20. Word-LR performed better with higher substitution rates (increased recall). Interestingly, Char-LR performance (particularly precision) dropped with GLOVE compared to using COPY. For CNN, smaller substitution rates seem preferable, since precision decreased quickly as the number of substitutions increased.
BPEMB results in Table 21 are consistent across the classifiers Char-LR, Word-LR and CNN. Substitutions in the range 12%-37% increased recall over COPY. However, precision dropped at different points, depending on the classifier. CNN precision dropped earlier than on other classifiers, already at 25% change rate.

H Augmented threat examples
We provide examples of augmented documents in Table 22. We picked a one-sentence document as the seed. We remark that augmented documents created by GPT-2 have the highest novelty, but may not always be considered threat (see example GPT-2 #1. in Table 22).

Classifier Metric
Fine-tuning epochs on GPT-2  Table 19: Impact of changing the proportion of substituted words on WORDNET-augmented datasets. Mean results for 10 repetitions. Classifier's highest numbers highlighted in bold.