Developing a New Classifier for Automated Identification of Incivility in Social Media

Incivility is not only prevalent on online social media platforms, but also has concrete effects on individual users, online groups, and the platforms themselves. Given the prevalence and effects of online incivility, and the challenges involved in human-based incivility detection, it is urgent to develop validated and versatile automatic approaches to identifying uncivil posts and comments. This project advances both a neural, BERT-based classifier as well as a logistic regression classifier to identify uncivil comments. The classifier is trained on a dataset of Reddit posts, which are annotated for incivility, and further expanded using a combination of labeled data from Reddit and Twitter. Our best performing model achieves an F1 of 0.802 on our Reddit test set. The final model is not only applicable across social media platforms and their distinct data structures, but also computationally versatile, and - as such - ready to be used on vast volumes of online data. All trained models and annotated data are made available to the research community.


Introduction
Given the growing polarization in the United States, the increasing popularity of partisan media, and the widespread use of social media for information and discussion (see Iyengar et al. (2019) for a review), many scholars and observers worry about the accelerated use and spread of incivility in the online environment. Incivility, defined as "features of discussion that convey disrespectful tone toward the discussion forum, its participants, or its topics" (Coe et al., 2014) is a common aspect of many online communities, especially anonymous forums (Reader, 2012) such as Reddit. Estimates suggest that more than 84% of Americans have experienced incivility online, and among those who have ever experienced it, the number of average weekly en-counters with incivility was as high as 10.6 times (KRC Research, 2018). In addition to lowering the standards of public discourse, incivility has concrete effects on users, online discussions, and social media platforms. The use of and exposure to incivility generates negative emotions, such as anger, anxiety, or mental distress, and is related to aggression (Gervais, 2015) and hostile communication (Groshek and Cutino, 2016). Incivility also turns users away from online discussions altogether (Anderson et al., 2014;Bauman et al., 2013;Moor et al., 2010;Ransbotham et al., 2016). Given these reasons for the public and industry to be concerned with online incivility, many companies seek to automatically detect incivility in order to understand its scope, identify the online communities in which incivility is particularly prevalent, and -ultimately -address the problem.
This project offers a step in this direction. We present machine learning models for detecting incivility in social media, models that are not only computationally efficient but also applicable across platforms. We propose both a BERT-based neural classifier as well as a logistic regression based classifier trained on manually annotated and artificially labeled data. Our results suggest that the proposed models perform well across distinct data/communication structures of different platforms, and, as such, can be easily applied to detect incivility.

Previous Work
There is considerable conceptual and operational ambiguity in the literature on incivility and related concepts under the umbrella of offensive or intolerant speech (see (Rossini, 2020) for a review). Some studies use incivility interchangeably with hate speech, which refers to speech that aims to discriminate against a certain identity group, or aggressive or toxic language, which includes personal attacks (Rösner and Krämer, 2016). However, incivility is a broader concept, which focuses on content that goes against acceptable social norms in terms of vulgarity, name-calling, or offensive language (Papacharissi, 2004), whereas hate speech or aggressive language captures more specifically discourse that offends, derogates, or silences others and may promote harm (Rossini, 2020). Increasingly, incivility is conceptually and operationally distinguished from such intolerant discourse, and evidence suggests that the effects of these two forms of expressions also differ (Rossini, 2020). Definitions of incivility vary, ranging from "a normdefying behavior" (Gervais, 2015), "an explicit attack" (Anderson and Huntington, 2017), to the violation of interpersonal politeness norms (Mutz, 2015;Mutz and Reeves, 2005), yet most include a lack of respect toward discussion participants or arguments (Santana, 2014), and a impolite tone of discourse (Papacharissi, 2004). The often used definition, which we adopt for the purpose of our machine learning model, sees incivility as features of discussion that convey disrespectful tone toward the discussion participants or its topics, including name-calling, mean-spirited or disparaging words directed at a person or group of people, an idea, plan, policy, or behavior, vulgarity, using profanity or language that would not be considered proper in professional discourse, and pejorative remarks about the way in which a person communicates (Coe et al., 2014). As such, our approach encompasses both the less societally detrimental foul language or harsh tone as well as the more intolerant discourse.
From a technical perspective, previous research using machine learning models to detect incivility and other offensive or intolerant language online has focused primarily on the use of logistic regression (Theocharis et al., 2020;Daxenberger et al., 2018;Maity et al., 2018), support vector machines (Joksimovic et al., 2019;Maity et al., 2018), and various neural classification models (Sadeque et al., 2019). BERT (Devlin et al., 2019) and related transfomer language models have been used in related tasks, such as identifying abusive language on Twitter (Nikolov and Radivchev, 2019;Risch et al., 2019), including many entrants in the OffensEval task at SemEval-2020 (Zampieri et al., 2020). To our knowledge, this paper is the first to utilize a fine-tuned BERT model to identify incivility on social media platforms, and one of few projects that train the classifier on data from more than one platform. Also, past work on identifying incivility over time has mostly analyzed Twitter data during certain political events, such as the 2016 presidential election in the US (Siegel et al., 2018), and/or looked at political incivility in specific contexts (e.g., among politicians, e.g., (Theocharis et al., 2020). These rather narrow, single-platform foci limit the applicability of the developed classifications, a limitation we address in this project.
In addition to these contributions of our work, our primary contribution may lie in our data augmentation method. Specifically, we extend recent approaches to automatically label additional training data to improve the performance of a logistic regression classifier. Previous work in detection of offensive language has used back-translation (Ibrahim et al., 2020) and data transformation techniques (Rizos et al., 2019) to augment limited training data. While some work (Theocharis et al., 2020) utilizes the Google Perspectives API to label additional training data, which introduces noise to the operationalization of incivility, we take advantage of our well-performing BERT classification model to generate artificial training data for a logistic regression classifier. The resulting classifier can be efficiently run on CPU and is far less computationally expensive than our comparably performing BERT model. This extension makes our classifier easily applicable to vast amounts of data and readily implemented on social media platforms or the comments sections of websites of news media organizations.
communities, known as subreddits. Each subreddit has a general topic, behavioral norms, and community standards, allowing for a creation of a diverse dataset, which further increases the applicability of the resulting machine learning model.
To tackle the detection problem, we identified the most popular subreddits from 2006 to 2019 that contained 95% of the total comments by (1) the number of comments in the subreddit each year, and (2) the number of followers that commented in the subreddit each year, which resulted in 9355 subreddits across the years. We then collected 5000 comments from these subreddits using stratified random sampling technique, such that the random sampling from each year is based on each year's proportion in the total number of comments. These 5000 posts were the manually labeled.

Dataset Annotation
Instead of adapting annotation schema that focused on profanity and swear words or phrases (i.e., the more narrow definition of incivility) (Zampieri et al., 2019;Mohan et al., 2017;Almerekhi et al., 2020), we developed a coding manual to classify comments according to four dimensions present in offensive speech more broadly. We account for whether a comment contains: (1) name-calling, mean-spirited or disparaging words directed at a person or a group of people; (2) aspersion, meanspirited or disparaging words directed at an idea, plan, policy or behavior; (3) pejorative or disparaging remark about the way in which a person communicates, and (4) vulgarity, profanity or language that would not be considered proper. Our operational approach accounted for the content aspect (e.g., vulgarity or profanity, such as "you're a dumbass for simplifying the issue and trying to jump right into the helm of the 'y'r all hypocrites' bandwagon") and the different targets of incivility or foul content included in the intolerant discourse (e.g., "... the interests of left-handed black female dwarves"), to create a comprehensive and inclusive annotated dataset for model building. Annotators were asked to apply a binary label to indicate whether or not the comment contains incivility. The annotators were three undergraduate students in social sciences at UC Davis, two native English speakers and one with English as the second language. Two annotators are heavy Reddit users and one is a user of other social media. The annotators were trained on the definitions and proce-dures, and each of them completed five pilot coding exercises. Each annotator first independently coded a random set of 50 comments with Fleiss's kappa of 0.618. They then compared results, discussed and resolved discrepancies, and clarified confusions. These steps were repeated multiple times with increasingly large comment sets until an acceptable agreement level was reached. In total, all three annotators completed 1000 comments together during training, with Fleiss's kappa of 0.663. The major discrepancies pertain to potentially sarcastic comments (e.g., "Great, now we're paying for CBC to promote cuckoldry"), which some coders saw as uncivil and others as innocent sarcasm. After an acceptable coding precision was established among the three annotators, the remaining 4000 comments were randomly divided into three sets and each annotator independently coded an assigned set. The final result of this process is a set of 5000 comments labeled for incivility. Additionally, our dataset includes coding at the subreddit level to identify subreddits that were political, non-political, or mixed (i.e., contained some political and some non-political content). This allows us to analyze the prevalence of incivility across different kinds of online discussions and across the political spectrum.

Classifier Training
To demonstrate the efficacy of our collected dataset, we use supervised machine learning to automatically identify uncivil Reddit posts. However, annotating a dataset large enough to train a state-ofthe-art neural classifier from scratch is a costly and time-consuming undertaking. We experimented with several neural binary classifiers, with our bestperforming models built on top of transformerbased language models, namely BERT (Devlin et al., 2019) and its relative, DistilBERT (Sanh et al., 2019). Past work has demonstrated that finetuning large, pre-trained language models, such as BERT and DistilBERT, is an effective method for creating a high-quality neural classifier with limited supervised training data. As described in Sun et al. (2019), we conduct additional pretraining of the BERT-base and DistilBERT-base models on a large collection of Reddit posts as in-domain data. Once pretrained, we fine-tune our models for classification on our annotated dataset of Reddit comments, which trained annotators classified with binary labels for incivility.
Finally, in an effort to extend past work by creating a more flexible, platform agnostic classifier, we train a logistic regression classifier for incivility prediction in social media by combining the data presented in Theocharis et al. (2020) with our annotated and artificially labeled datasets.
Our Reddit dataset (including annotation disagreements), test predictions, scripts and models are available on the project GitHub repository 1 .

Experiments and Results
Our BERT and DistilBERT models begin with the respective base pretrained language models, as implemented in HuggingFace's Transformer's package. We then further pretrain these models on dataset of 3 million Reddit posts, for 100,000 training steps (as suggested by Sun et al. (2019)) using the masked word prediction task (Devlin et al., 2019). We then utilize these pretrained models in a classification setup, utilizing a softmax layer to predict binary class probability based on the [CLS] token in BERT's final hidden layer. For classification fine-tuning, all inputs to the models are limited to 256 tokens in length, with a training batch size of 16. We use the AdamW optimizer (Gugger and Howard, 2018) with default learning rate and epsilon values. We fine-tune our model for classification for four epochs on our dataset of 5,000 Reddit posts which are coded for incivility, with 10% of the data set aside for training validation, and 1000 annotated posts set aside for model testing. Classification results using BERT and DistilBERT are shown in Table 1.

Model
Precision One major goal of this project is to classify multiple years of Reddit data for further analysis of incivility across political and non-political subreddits. Despite the acceptable performance of our BERT classification models, the models were too computationally expensive to classify the approximately 800 million posts per year we collected from Reddit. To address this constraint, we also train a logistic regression classification model to be able to classify large amounts of Reddit data with-out the use of expensive neural classifiers. However, given the small size of our annotated training set, we must generate additional training data to train an effective logistic regression model. In order to improve system performance, we first use our fine-tuned DistilBERT model to classify a large collection of Reddit posts. We then uptrain a logistic regression model on this synthetic data, along with our annotated data. As detailed in Section 5, the resulting model achieves an F 1 score which is competitive with our BERT and DistilBERT models, while also being able to classify data more quickly and at lower computational cost, making our model widely applicable.
All logistic regression models are trained using TFIDF of stemmed unigrams as features. Given the relative imbalance of labels in our training data, in which positive examples of incivility represent only 10.3% of annotated posts, we use ADASYN (He et al., 2008) to generate additional synthetic data for oversampling. We train a second model on synthetic data consisting of 5 million Reddit posts which are labeled for incivility using our trained DistilBERT model. Results are shown in Table 2 To test the overlap of concepts such as hate speech and offensive language with incivility, we applied the classifier provided by Davidson et al. (2017) to the test portion of our Reddit dataset. To conduct our test, we combined the classes "offensive language" and "hate speech" predicted by the Davidson et al. (2017) classifier into a single class. On our Reddit data, this classifier achieves an F1 of 0.242, indicating limited overlap between the these domains. This test demonstrates that the definitional, conceptual, and operational differences between incivility and related domains of offensive speech are indeed represented in our labeled data.
In order to further test the efficacy of our implementation, we train a logistic regression model as outlined above using Twitter data collected and annotated by Theocharis et al. (2020). Finally, to create our platform agnostic model, we train a logistic regression model by combining our annotated and synthetic Reddit data with the annotated and synthetic data from Theocharis et al. (2020), which we test on the Theocharis et al. (2020) Twitter test set, as shown in Table 3.

Analysis and Discussion
Our encouraging results in classifying incivility in Reddit posts demonstrate the efficacy of our dataset   Due to the scale of the data to be ultimately classified, we were concerned as much with computational efficiency as with prediction accuracy when building our incivility classifier. When we use our trained BERT model to generate a large quantity of synthetically labeled training data, the performance of our log regression model is comparable to that of the our fine-tuned BERT models.
Concern with computational efficiency also informed our choice of features in our logistic regression model. While alternate features could be used, such as Doc2Vec or Word2Vec embeddings, we chose to use TFIDF due to the simplicity of calculating these features. Additionally, the choice of TFIDF was informed by the work of Theocharis et al. (2020), who demonstrate the utility of TFIDF for the task for incivility classification. Finally, the fact that our TFIDF-based logistic regression model performs similarly well to the BERT model is evidence of the effectiveness of the choice of TFIDF features. That said, the use of alternate features may improve model performance, and we leave this to future work.
The similarity between the predictions made by our BERT model and our logistic regression model indicates that the logistic regression model retains much of the predictive power of the BERT model. In fact, across 996 test comments, the two models disagreed on only 27 comments, for a rate of 2.7%. From reviewing the disagreements we can identify several classes of comment on which the two mod-els often disagree. The first, and most obvious, is very long comments. BERT is designed to truncate long input text (our implementation truncates inputs longer than 256 tokenized word pieces). Thus, our BERT model may mislabel longer comments in which the incivility occurs later in the comment. Another source of disagreement comes from the fact that our TFIDF-based classifier tends to be more sensitive to individual lexical items, which is to be expected as BERT is known to condense far more semantic information than do count-based vectorization techniques such as TFIDF (Jawahar et al., 2019). For example, our regression model mislabels the comment "This is dope! Does anyone know where I can purchase one for myself?" as an uncivil comment, presumably due to the presence of the word "dope", while our BERT model labels the comment correctly. In future work, we plan to conduct a more rigorous analysis of labelling disagreements between the two models to better understand the role of lexicon and compositional semantics in the incivility classification task.
Finally, we demonstrate the flexibility of our model training strategy by creating a combined incivility prediction model using our automatically labeled Reddit data with the synthetic data provided by Theocharis et al. (2020). The resulting model has shown promise as a platform agnostic incivility classifier model for social media.

Conclusion and Future Work
In this paper, we present a new dataset of Reddit posts annotated at the comment level for incivility, as well as at the subreddit level for political content. Further, we demonstrate the efficacy of this dataset to train machine learning models for incivility detection, both alone and in combination with previously available datasets, to create a platform agnostic classifier for incivility on social media.
Using our trained classifier, our future goal is to provide a systematic overview of trends in incivility on social media, across time and variety of discussion topics. The project aims to capture the fluctuations in the prevalence of incivility in political and non-political online spaces, politically homogeneous and heterogeneous discussions, liberal and conservative ones, and also among different non-political topics. The anticipated study will add our understanding of the development of online incivility and shed light on incivility interventions.