Fine-tuning BERT for multi-domain and multi-label incivil language detection

Incivility is a problem on social media, and it comes in many forms (name-calling, vulgarity, threats, etc.) and domains (microblog posts, online news comments, Wikipedia edits, etc.). Training machine learning models to detect such incivility must handle the multi-label and multi-domain nature of the problem. We present a BERT-based model for incivility detection and propose several approaches for training it for multi-label and multi-domain datasets. We find that individual binary classifiers outperform a joint multi-label classifier, and that simply combining multiple domains of training data outperforms other recently-proposed fine tuning strategies. We also establish new state-of-the-art performance on several incivility detection datasets.


Introduction
In 2019, 93% of Americans identify incivility as a problem, with 68% classifying it as a "major" problem, and those who experienced incivility faced on average 10.2 uncivil interactions each week (Weber Shandwick et al., 2019). Of those who expect civility to get worse, "social media/the Internet" tops the list of what they blame, above "the White House", "politicians in general", "the news media", etc. Especially on social media and the Internet, this incivility often takes the form of uncivil language, features of discussion that convey an unnecessarily disrespectful tone toward the discussion forum, its participants, or its topics (Coe et al., 2014).
Uncivil language can range from name-calling (e.g., Mark, you're some kind of special stupid) to vulgarity (e.g., Just build the damn mine already!) to threats (e.g., Fine. I will destroy you.) and beyond. Different types of incivilities often appear in the same utterance (e.g., name-calling, vulgarity, and threats are all included in SHUT UP, YOU FAT POOP, OR I WILL KICK YOUR ASS!!!). Uncivil language appears in many places online, from microblogs like Twitter, to comments on online newspapers, to edit histories of resources like Wikipedia.
Uncivil language detection is thus a multi-label and multi-domain language processing problem. While there has been much research in natural language processing methods for identifying such incivility, especially in the subarea of abusive language (Wiegand et al., 2019;Zampieri et al., 2019;Basile et al., 2019;Sadeque et al., 2019;van Aken et al., 2018, etc.), the multi-label and multi-domain nature of incivility detection is understudied. We thus consider incivility detection on several datasets that (1) require the classification of incivility into several not-mutually-exclusive fine-grained categories, and (2) cover multiple genres of online interactions. Our contributions are: • We achieved a new state-of-the-art on both the Coe et al. (2014) and Conversation AI (2018) datasets using BERT (Devlin et al., 2019). • We compared several algorithms for training classifiers across the multiple domains in these datasets and showed that combining the training data from all domains outperforms other recently-proposed fine-tuning strategies. • We compared several approaches for handling the multi-label nature of these datasets and showed that independent binary classifiers outperform jointly-trained models.

Task
We frame uncivil language detection as a multilabel text classification problem, where the input is a piece of text, and the outputs are the types of incivilities (name-calling, vulgarity, etc.) that are present. Formally, we aim to learn a function h such that for each piece of text x: where repr(x) is a tensor representing that text (e.g., a series of word vectors), and y is a binary vector where y i is 1 if x contains the i th form of incivility and 0 otherwise.
We frame learning such h functions a multidomain classifier training problem, where training and testing data are drawn from multiple domains (news comments, politician tweets, etc.). Formally, given a domain D i , we aim to learn a function h D i that maximizes performance on test data D itest by training on examples (x, y) drawn from training data D 1 train D 2 train . . . D n train .

Data
We consider the following datasets for evaluating multi-label and multi-domain incivility detection.
Local news comments In this multi-label dataset, the following labels are defined and used to annotate online comments on local news articles by Coe et al. (2014): • aspersion: "Mean-spirited or disparaging words directed at a person or group of people." • lying accusation: "Mean-spirited or disparaging words directed at an idea, plan, policy, or behavior." • name-calling: "Stating or implying that an idea, plan, or policy was disingenuous." • pejorative: "Using profanity or language that would not be considered proper (e.g., pissed, screw) inprofessional discourse." • vulgarity: "Disparaging remark about the way in which a person communicates." Local politics Tweets Coe and colleagues also annotated a collection of microblog posts from the Twitter accounts of their local politicians, but only for name-calling incivility.

Russian troll Tweets
Coe and colleagues also annotated a small subset of the 3 million English Tweets written by Russian trolls and collected by Linvill and Warren (2018) 1 , again for just name-calling incivility. Wikipedia comments In this multi-label dataset, also known as the Kaggle Toxic Comment Classification Challenge, Jigsaw/Google's Conversation AI team annotated comments from Wikipedia's talk page edits (Conversation AI, 2018) for the presence of the following types of abusive language, defined by Perspective AI (2020).
• toxic: "A rude, disrespectful, or unreasonable comment that is likely to make people leave a discussion." • severe-toxic: "A very hateful, aggressive, disrespectful comment or otherwise very likely to make a user leave a discussion or give up on sharing their perspective. This attribute is much less sensitive to more mild forms of toxicity, such as comments that include positive uses of curse words." • obscene: "Swear words, curse words, or other obscene or profane language." • threat: "Describes an intention to inflict pain, injury, or violence against an individual or group." • insult: "Insulting, inflammatory, or negative comment towards a person or a group of people." • identity-hate: "Negative or hateful comments targeting someone because of their identity." Table 1 shows statistics for the different data sets.
The three datasets annotated by Coe and colleagues can be used in multi-domain experiments, as they share the same annotation scheme. They share only the label name-calling, so our multi-domain experiments consider only binary classification. The local news comments and Wikipedia comments datasets can be used in multi-label experiments, as they have been annotated for multiple forms of incivility. They do not share annotation schemes, so our multi-label experiments consider each multi-label dataset separately.

Prior Work
There is much recent work on detecting incivility (also referred to as toxicity, abusive language, offensive language, etc.) in social media. Wiegand et al. (2019) presents an overview of such efforts and shows that many datasets constructed for this purpose have unintended bias because of how they have been sampled. We focus on the Coe et al. (2014) and Conversation AI (2018) datasets because they do not have the problems with topicbiased sampling that some other datasets do, where topic words are better predictors of incivility than uncivil words. There have also been several recent shared tasks that consider incivility. Both the OffensEval shared task (Zampieri et al., 2019) and the HatEval (Basile et al., 2019) shared task ran as part of SemEval-2019 and considered detection of various forms of offensive and hate speech. Neither of these tasks focused on a multi-label or multi-domain problem.
A few models have been designed for and evaluated on the multi-label, multi-domain corpora we consider. Sadeque et al. (2019) considered the local news comments corpus, training recurrent neural network models, and focusing on only the top two most frequent labels for this dataset. They achieved 0.48 F 1 for name-calling and 0.53 F 1 for vulgarity. van Aken et al. (2018) presented multiple approaches to the Wikipedia comments dataset. They developed an ensemble of logistic regression, recurrent neural networks, and convolutional neural networks, achieving an AUC score of 0.983.
There are a few recent works in cross-domain abusive language detection. Wiegand et al. (2018); Karan andŠnajder (2018); Pamungkas and Patti (2019) all explore training models on one abusive language dataset and testing on another. They focus on binary predictions and bag-of-words support vector machine classifiers (though Pamungkas and Patti (2019) also explores a recurrent neural network). They do not consider multi-label problems, or modern pre-trained neural networks like BERT, which were more successful in recent shared tasks on abusive language (Zampieri et al., 2019). They also evaluate on several datasets that have been identified as problematic by Wiegand et al. (2019) due to their use of topic-biased sampling.

Experiments
We use BERT (Devlin et al., 2019) as the starting point for all experiments. BERT is a pre-trained transformer-based neural network that has shown impressive performance on a wide variety of NLP tasks. We follow the standard approach for finetuning BERT for text classification, placing a fully connected layer over BERT's [CLS] output. We use n sigmoids on this layer rather than a softmax activation, since we are performing multi-label classification. BERT is then fine-tuned as usual, with hyperparameters like learning rate, maximum sequence length, number of epochs, training batch size tuned on the development set. We explored each hyperparameter within the following ranges: learning rate: 8e-6, 2e-5, 4e-5, 8e-5 maximum sequence length: 128, 256, 512 number of epochs: 2, 3, 4, 5, 6, 8 training batch size: 16, 32, 64, 128

Multi-domain models
We consider three methods for training classifiers for prediction in multiple domains: Single One classifier is fine-tuned for each domain. Joint One classifier is fine-tuned on the combined training data from all the domains. Joint→Single First, a joint classifier is fine-tuned.
Then, the joint classifier parameters are used to initialize n individual classifiers, one for each domain. This approach is inspired by Liu et al. (2019a), where for some natural language understanding problems, they found that multi-task fine-tuning followed by individual task fine-tuning outperformed multitask fine-tuning alone.
Since our multi-domain datasets share only the label name-calling, we train our multi-domain classifiers only for binary classification (i.e., they are not also multi-label).   Table 2 shows the results of these experiments. The first three rows compare the different training procedures on the development sets. We find that simply combining all the data achieves the best F 1 for both the local news comments and Russian troll Tweets data, and similar F 1 to the more complicated Joint→Single procedure in the remaining dataset. When we evaluate this best model on the test data, we achieve a new state-of-the-art on the local news comments corpus, 0.56 F 1 . We are the first to report results on the local politics Tweets and Russian troll Tweets domains, as Sadeque et al. (2019) did not evaluate on these.
These results did not replicate the findings of Liu et al. (2019a) when applied to our incivility datasets; the extra fine-tuning for each domain was unhelpful, and simply combining all the data was the best. This probably argues for exploring other approaches for domain adapatation, e.g., Kim et al. (2016), but it may also simply suggest that Coe et al. (2014)'s annotators were consistent across datasets, making it easy for BERT to learn the core linguistic phenomenon despite differences in domains.

Multi-label models
Similar to our approach for multi-domain models, we consider three methods for training classifiers for multi-label prediction: Single One binary classifier is fine-tuned for each label. The output layer of the model is a single sigmoid unit. Joint One joint classifier is fine-tuned for all labels.
The output layer of the model is n sigmoid units, one for each label. Joint→Single First, a joint classifier is fine-tuned.
Then, the joint classifier parameters are used to initialize n binary classifiers, one for each label. This is again inspired by the multi-task training procedure of Liu et al. (2019a).
Since our multi-label datasets do not share an annotation scheme, we train the multi-label classifiers on only one dataset at a time (i.e., they are not also multi-domain). Table 3 shows the results of these experiments 2 . We find that in most cases training individual binary classifiers (Single) is better than a jointly-learned multi-label classifier (Joint). This is somewhat surprising as the latter is the standard approach with neural networks (Adhikari et al., 2019).
Curious if the problem was some low-frequency classes, we tried training a multi-label model on just the three most frequent classes of the Wikipedia comments dataset (Joint top-3 classes), toxic, obscene, and insult. That slightly improved performance on those three classes, but of course at the cost of the classes now being ignored. Adding the staged training procedure (Joint→Single) on top of this classifier only decreased performance. This suggests that class imbalance may be part of the problem, but is not the full explanation.
Note that we are the first to report all individual label F 1 s on both datasets. In the case of the local news comments data, this is because Sadeque et al. (2019), noting the class imbalance problem, decided to only train and evaluate on two classes. In the case of the Wikipedia comments data, this is because the official evaluation metric is AUC, so most systems focused on optimizing this measure. However, as Table 3 shows, while we achieve a state-of-the-art AUC, AUC is not a very discriminative measure for this dataset. For example, both the Single model that predicts all six classes and the Joint top-3 classes model that doesn't even try to predicts severe-toxic, threat, or  identity-hate achieve the same AUC of 0.990. The F 1 scores more clearly show that the Joint top-3 classes model is as good or better for all labels but insult.

Limitations
We focused on a BERT-based model due to its top-ranking performance in related shared tasks (Zampieri et al., 2019), but recent advances over BERT, e.g., RoBERTa (Liu et al., 2019b) might yield additional gains. We also focused on the limited number of datasets that could support multilabel and/or multi-domain experiments, but our results could be strengthened by creating new multilabel, multi-domain datasets. Finally, class imbalance only partly explains why a joint multi-label classifier failed to outperform independent binary classifiers, indicating that further investigation is needed into multi-label classification approaches for uncivil language.

Conclusion
We applied BERT on multi-label and multi-domain incivility detection tasks and achieved a new stateof-the-art on several different datasets. In exploring different training procedures, we found that it was better to directly combine data from multiple domains than other more complex procedures, and that it was better to train individual binary classifiers than to train a joint multi-label classifier.