Incivility Detection in Online Comments

Incivility in public discourse has been a major concern in recent times as it can affect the quality and tenacity of the discourse negatively. In this paper, we present neural models that can learn to detect name-calling and vulgarity from a newspaper comment section. We show that in contrast to prior work on detecting toxic language, fine-grained incivilities like namecalling cannot be accurately detected by simple models like logistic regression. We apply the models trained on the newspaper comments data to detect uncivil comments in a Russian troll dataset, and find that despite the change of domain, the model makes accurate predictions.


Introduction
Online harassment, colloquially known as cyberbullying or cyber harassment, has been rampant since the introduction of the Internet to the general population. It has been a major cause of concern since the mid-and late-90's, and is a thoroughly researched topic in the fields of social science, behavioral science, network science and computer security. Cyberbullying is a form of harassment that is carried out using electronic modes of communication like computer, phone, and in almost all the cases in recent years, the Internet. Cyberbullying is defined as a "willful and repeated harm inflicted through the medium of electronic text" by Patchin and Hinduja (2006)-but this phenomenon goes far beyond the scope of just electronic text. A more comprehensive definition of cyberbullying can be found in one of their later works, where they defined cyberbullying as "a form of harassment using electronic mode of communication" (Hinduja and Patchin, 2008). Fauman (2008) described cyberbullying as "bullying through the use of technology such as the Internet and cellular phones".
The spectrum of online harassment is vast; hence, we focus on one segment of this phenomenon: online incivility. Incivility has been rampant in American society for quite some time. Incivility is described as features of discussion that convey an unnecessarily disrespectful tone toward the discussion forum, its participants, or its topics (Coe et al., 2014). While it is often said that incivility is "very much in the eye of the beholder" and what is civil to someone may be uncivil to another , some are universal nevertheless. One study has suggested that 69% of Americans believe that incivility in public discourse has become a rampant problem, and only 6% do not identify it as a problem (Shandwick, 2018). The average number of incivility encounters per week has also risen drastically in both the physical world and cyberspace. Social media encounters are especially alarming: a person who encountered any form of incivility anywhere, had on average 5.4 uncivil encounters per week in online social media platforms in 2018, which is almost double the amount from late 2016.
In this paper, we present machine learning models that can identify two prominent forms of incivility, name-calling and vulgarity, based on usergenerated contents from public discourse platforms. We focused trained recurrent neural network models on an annotated newspaper comment section and showed that our model outperforms several baselines, including a state-of-the-art model based on pre-trained contextual embeddings. We applied our newspaper-comments-trained model to a datsaets of Russian troll tweets to observe how the model generalizes from one platform to another.  divided incivility into several different forms, including name-calling, vulgarity, lying accusation, pejorative, and aspersion. They took comments posted by regulars in a newspaper website, and annotated these for the various forms of incivility. Their research focused mostly on the demographics and other individual attributes of readers of these comments and how they perceived incivility in these comments.  focused more on the perpetrators of incivility rather than the readers. They researched a handful of news articles published in the Arizona Daily Star newspaper website and the comments posted about these articles, then manually annotated these comments and their posters for their incivility and political orientation. The authors found that conservatives were significantly less likely to be uncivil in these public discussions compared to liberals, and the likelihood of liberals being uncivil increased with the presence of conservatives in the same discussion. Liberals were also found to be more repercussive compared to the conservatives.

Related Works
Recent work has focused on particular forms of incivility, as described in the following sections. Reynolds et al. (2011) developed machine learning models that can detect cyberbullying by identifying curse and insult words in social media posts. They have collected a small set of posts from a website named formspring.me and used various non-sequential learning algorithms on this dataset to build a binary classifier for cyberbullying detection.  used a vulgarity score for better sentiment prediction from a collection of 6800 tweets. They found that vulgarity interacts with key demographic variables like gender, age, religiosity, etc. Other research has also identified demographic keys closely associated with vulgarity: Wang et al. (2014) presented a quantitative analysis on the frequency of curse word usage in Twitter and their variation with certain demographics, and Gauthier et al. (2015) analyzed the usage of swear words based on Tweeter users' age and gender. As none of these papers present any machine learning model that can be used for vulgarity detection,  claim their work to be the first in vulgarity prediction. They classified functionality of vulgarity in five different cohorts: aggression, emotion expression, emphasis, auxiliary and signalling group identity; and used binary logistic regression classifiers to identify vulgar texts. They also showed the correlation among demographic variables and vulgarity and found that age, faith, and political ideology have significant correlation with vulgarity usage.

Racism/sexism
Waseem and Hovy (2016) has presented machine learning models that can be used to detect racism and sexism in social media. They have collected and annotated a set of almost 17000 tweets, and used them to build character based n-gram models for offensive tweet detection. They have provided an extensive list of criteria that identify a tweet as racially and sexually offensive, and showed that demographic information does not add much performance to a character-level model. Wulczyn et al. (2017) introduced a methodology to generate annotations for personal attacks. They have used crowdsourcing to identify a set of Wikipedia comments, and used a machine learning model to imitate this annotation on a much larger scale. Agrawal and Awekar (2018) have developed deep neural models that can detect cyberbullying (Reynolds et al., 2011), racism/sexism (Waseem and Hovy, 2016), and personal attacks (Wulczyn et al., 2017) in multiple social media platforms. They claim that theirs is the first work to systematically analyze cyberbullying in social media towards building deep prediction models. They have shown that hand-crafted features using lexicons is not a good idea as abusive word vocabularies vary a lot from one social media platform to another, and swear words are not always considered to be uncivil in social media.

Personal attacks
2.5 Name-calling Habernal et al. (2018) analyzed ad hominem attacks in Change My View, a "good faith" argumentation platform that is hosted on Reddit.
They identified posts that Reddit moderators had marked as violating the forum's rules against ad hominem atacks. To identify such posts, they used stacked bidirectional Long-Short Term Memory networks (LSTMs) and Convolutional Neural Networks (CNNs), and achieved 78% and 81% accuracy, respectively. One of their most interesting findings was that in 48.6% of the cases, ad hominem attacks are in the last comment of the thread, which shows that personal attacks and name-callings can affect user participation in public discourses.
Works that closely resemble what we are trying to do have one major issue with the datasets that have been used-they are often annotated by mechanical turks (Wulczyn et al., 2017;Reynolds et al., 2011). Incivility is based on the perception of the person in the receiving end, and this perception varies wildly from person to person. Using turkers that we know almost nothing about is not ideal-as difference in perception may introduce unintended bias in the dataset. Hence, we need a dataset that is annotated by experts who have extensive knowledge on incivility detection. Coe et al. (2014) presents one such dataset, and we plan to use this for our incivility detection task (more on this in Section 4).

Incivility Classification and Definitions
For our work, we will use the incivility classification presented by Coe et al. (2014): name-calling, vulgarity, aspersion, lying accusations and pejorative for speech. We focus on the two most prevalent forms of these in Coe et al. (2014)'s data: namecalling and vulgarity.
name-calling Ad hominem attacks. Although ad hominem attacks are often used to derail a conversation by using derogatory terms towards another person, the authors have included every instances of derogatory remarks, irrespective of target and intention. For example, At least the morons in the state capital no longer have control of this process! is identified as a name-calling comment as it has the word moron in it .
vulgarity Contents that include any sort of curse words, including minor ones such as damn . For example, I hope the voters will kick that politician out on his pompous ass next election. is marked as vulgar, as it contains the word ass in it. Coe et al. (2014) graciously shared with us the data that they collected from the comment section of the Arizona Daily Star newspaper. They collected articles and comments between 17 October 2011 and 6 November 2011 from eight news sections: Business, Entertainment, Lifestyles, Local News, Nation and World, Opinion, Sports, and State News. All their data was downloaded and saved manually by one research assistant one day after the articles were posted to provide enough time for the article to garner comments, yet not long enough for the article to be deleted. At the end of the data collection period, a total of 706 articles and 6535 comments were collected, out of which they coded 6444 for further analysis. They used three teams of 3-5 research assistants to code articles and comments for incivility. The teams had extensive training on the coding procedures (Coe et al., 2014). The coding process took approximately six weeks, and chance-corrected intercoder reliability was established prior to the coding, which ranged between 0.61 to 1.0 Krippendorff's alpha score for different codes. In addition to coding the incivilities present in the comments, they also coded a variety of other metadata, e.g., the author's name, reactions received for other readers (thumbs up or thumbs down), word counts, etc. All the results of the coding procedure were saved in a metadata file created using Microsoft Excel. Comments were saved in separate PDF files named based on the news sections, articles and dates.

Challenges in Identifying incivilities from User Contents
As we have mentioned before that incivility is in the eye of the beholder, it is sometimes challenging to identify what can be unequivocally considered as uncivil interaction. Informed by the Coe et al. (2014) data, the following sections discuss some of these challenges.

Frequency
Although researchers have identified incivilities being rampant in public discourse (Shandwick, 2018), it is still minuscule compared to regular civil discourses in any social platform. As most of our identification and prediction techniques are data-driven, it is difficult to create a model that can identify incivilities from this small number of examples.

Linguistic Variations and Creativity
Oftentimes people refrain from using an exact version of an uncivil phrase and use an abbreviation or spelling variation of that phrase instead. For example, in All BS, just like the politicians -the same crap, the term BS is clearly an abbreviation of the word bullshit. However, there are also instances in the data where BS is used to abbreviate a person's name, which clearly is not an example of uncivil comment. Also, people often like to write uncivil words in spellings that are a derivative form.
For example, people often use sh!t instead of shit, which clearly are the same thing in a public discourse. Hundreds of these variations may exist, making for a challenging identification problem. Another challenge in identifying incivilities is that people can be really creative when they try to attack someone. This often happens when someone tries to indulge in ad hominem attacks with plausible deniability. For example, we have observed people using the word DemocRat instead of Democrat to identify someone with a democratic political orientation. Although these two words look similar, and sound exactly the same, Demo-cRat indicates that the target democrat is also a rat, a colloquial word for a spy, or a dishonest person. There are many other examples of this kind of variation, e.g. democraps. This phenomenon is sometimes referred as Obscenity Obfuscation, and researchers have found that it is becoming increasingly common in user generated contents in all sorts of social media platforms (Rojas-Galeano, 2017).

Difficulty in Comprehension
It is sometimes difficult to understand whether a word or a phase is used in an uncivil manner without understanding the context. For example, the word lazy can be used to describe the state of something that is actually slow or ineffective (e.g., lazy algorithms), or it can be used as an ad hominem attack on someone (e.g., the lazy politicians have ruined this country). As understanding the context of a content in a public discourse is difficult, separating these cases based on their contexts is challenging.

Incivility Prediction
In this section, we focus on our attempt to create a machine learning model that can be used as an incivility filter for moderators in social media plat-forms. Our model will exclusively use features obtained from the contents and reciprocations in the platform, while avoiding the demographic information that was used heavily by prior work. This will allow our models to be used on the large portion of online discourse where such demographic information is unavailable, e.g., where users are anonymous.

Data preparation
We will train our incivility prediction models on the Coe et al. (2014) data discussed in section 4. However, that data were designed for use in social science research, not natural language processing research, and thus there were several challenges in working with the data as they were collected, including: • The comments were saved in PDFs, and the metadata referenced each comment by a number that was drawn (not typed) into the PDF beside the comment.
• The naming conventions for the files were inconsistent (spelling variations, variable length identifiers, etc.) • Dates were saved using multiple formats (ddmmyy, dd-mm-yy, etc.) • There were no specific markers in the text that identified the start and end of a comment.
• Many comments contained quotations from other comments, also with no consistent markers of where quotes began or ended.
We solved these problems using a combination of regular expressions (e.g., for normalizing dates), brute-force techniques (e.g., quotations were identified by comparing against all previous comments), and manual revision (e.g., renaming the files whose names were too inconsistent to be resolved automatically). The resulting set of annotated comments were saved in JSON format for further computational analysis. We ended up with 6175 comments from the original set of 6444 comments after the extraction and cleaning process.

Prediction Task
Our main focus was to build a prediction model that can work as a filter for incivility in public discourse. We were also interested in how a model trained on public discourse data would work on a social media platform. We first divided our dataset into three smaller sets: train, development and test sets. Comments are randomly assigned to sets, and we ended up with 3950 comments in the training set, 989 comments in the validation set and 1236 comments in the test set. We set the the test set aside for our final evaluation, and worked only on the training and validation dataset to find the best model that can fit the problem.

Baselines
We found a similar task in Kaggle 1 (Wulczyn et al., 2017) that tries to identify toxicity of comments in the discourse section of Wikipedia. In that task, the best performing model was a recurrent neural network model with gated recurrent units (GRUs; Cho et al., 2014), but some simple non-sequential models (logistic regressions and support vector machines) also performed almost as well as the sequential model on that task.
For our baseline, we used two non-sequential machine learning techniques: logistic regression and support vector machines, using TF-IDF vectors obtained over words in the comments. We also considered a state-of-the-art out-of-the-box text classification model as a baseline, the Flair text classification model (Akbik et al., 2018), which uses GloVe word embeddings (Pennington et al., 2014) and pre-trained contextual word embeddings derived from two character-level language models. Flair achieved state-of-the-art performance in partof-speech tagging and named-entity recognition tasks, and we thought that the character-based nature of the Flair model might be helpful in the face of the linguistic variation and creativity challenges we discussed earlier.

Model
Our model was inspired by the top performing systems in the Kaggle competition, and started with FastText embeddings (Joulin et al., 2016) for each of the words in a comment. These word vectors were fed to a recurrent layer consisting of bidirectional GRUs. The outputs of the GRUs were fed to an average pooling layer and a max pooling layer, which were then concatenated 2 . The output of the pooling was then fed through a sigmoid layer to produce the outputs. To avoid overfitting, we used a dropout layer (Srivastava et al., 2014) with 0.2 probability in between the input and hidden layer. We set the maximum length of input to 500 words for each comment, as this garnered the best validation performance in our preliminary analysis. We set class weights based on the frequency of namecalling and vulgarity: non-name-calling comments are 7 times more common than the name-calling ones, and non-vulgar comments are 35 times more common than vulgar ones, so we used a weighting scheme of 1:7 for name-calling and 1:35 for vulgarity. The model was trained with the Adam optimizer (Kingma and Ba, 2015) on mini-batches of size 32, with other hyperparameters set to their defaults. We trained each instance of this model for at most 500 epochs, with the option of early stopping if the validation accuracy did not improve for 10 consecutive epochs. A general structure of this model is shown in figure 1.
To further improve our model, we wanted to incorporate any metadata that were available to use. Coe et al. (2014) found that the thumbs up and thumbs downs received by a comment, the section of the article, and the author of the article all had some significance regarding incivility in the forum. So we introduced these metadata as features in our model. We created normalized feature vectors built on these attributes, and introduced them as auxiliary features right before the sigmoid layer, by concatenating them with the output of the pooling layers.
We also explored external resources that could improve our model. We created a pretrained model on the Kaggle dataset discussed earlier, as it had a large amount of annotated comments (over 160 thousand comments obtained from Wikipedia contributor's community). We used the same RNN model to train on the Kaggle data until it reached convergence, then retrained the model using our Arizona Daily Star data. The only portion of the model that was not shared between the pre-training (on Kaggle) and the training (on Arizona Daily Star) was the output sigmoid layers.

Experimental Results
The performance of the different models can be seen in instances of vulgarity in the development dataset, hence, Flair automatically outperformed these two. But our GRU-based model easily outperformed the Flair model (51.13 vs. 36.55 F 1 in name-calling, and 48.00 vs. 11.43 F 1 in vulgarity). These results stand in contrast to the Kaggle competition on toxicity detection, where such baselines performed nearly as well as the best (GRU-based) model, and all models achieved high levels of performance (>0.98 area under receiver operating characteristic curve). This suggests that the finer-grained incivility detection formulated by Coe et al. (2014) is more challenging than simple toxicity detection.
Adding the auxiliary features (upvotes, etc.) to the GRU-based model had virtually zero effect, with slight improvement on the model's precision but a slight drop in recall for name-calling, and absolutely no change for vulgarity. Using the Kaggle dataset to pre-train our GRU-based model before training on the Arizona Daily Star data yielded very high precisions, but at the cost of very low recalls. This suggests that while there is some overlap between the two tasks (toxicity detection and incivility detection), the differences between the tasks make it difficult to directly leverage the data from one task in the other.

Incivility Prediction in Twitter
Though we built our models to detect incivilities in newspaper comments, we were interested in how well they would perform in other domains of social media. Karan andŠnajder (2018) has showed that cross-domain adaptation for detecting abusive language is possible-hence we would like to observe how well our model performs on a set of tweets.
In June 2018, The United States House Intelligence Committee released a list of 3841 Twitter account names that were human-operated troll accounts associated with Russia's Internet Research Agency. Darren Linvill and Patrick Warren from Clemson University collected all the tweets published since June 2015 from these accounts, cleaned them, and published a set of almost 3 million tweets (Linvill and Warren, 2018). These tweets are publicly available in FiveThirtyEight's Github page 3 .
As prior research suggest that trolls are a big source of incivility in social media platforms (Fauman, 2008;Hinduja and Patchin, 2008), we took this opportunity to observe how our model performs on this dataset. We downloaded all the tweet texts and ran our GRU-based model on these texts. Results of this experiment can be found in the au-3 https://github.com/fivethirtyeight/ russian-troll-tweets thor's GitHub repository 4 .

Observations
Our model identified 13% of all tweets as namecalling and 1.7% as vulgarity. These are roughly similar to the Arizona Daily Star training data, which had 14% name-calling and and 2.8% vulgarity. Though we do not have access to the expert annotators used by Coe et al. (2014), but we can nonetheless get an approximate measure of our model's performance by sampling predictions from our model and estimating the true label following the Coe et al. (2014) annotation guidelines.
To measure our model's precision, we took the 250 tweets that our model was most certain contained name-calling, and the 250 tweets that our model was most certain contained vulgarity. We manually reviewed each of these 500 tweets, and found only 7 instances of mistakenly tagged namecalling and 5 instances of mistakenly tagged vulgarity. To get a rough sense of our model's recall, we looked at the other end of the model's prediction spectrum. Based on a manual review of the model's prediction, the model almost never makes a mistake when the prediction score is below 10%; we found only one instance of mistaken name-calling, and no instance of mistaken vulgarity in the bottom 250 tweets that we manually annotated. Table 2 shows some example tweets and the prediction scores from our model. The bottom two examples under name-calling and the bottom one example under vulgarity represent mistakes. In the first name-calling error, the model is confident (probability 0.979) that there is a name-calling, perhaps because the terms GOP and POTUS frequently appear with name-calling in our training data. In the second name-calling error, the model is confident (probability 0.989) that there is a namecalling, likely because of the presence of the word pathetic, which is an aspersion, attacking an idea, not a name-calling, attacking a person. In the vulgarity error, hell has not been used to reference the religious concept of hell, but the word strongly associated with vulgarity in the training data. The table also shows some examples of reasonable successes of the model, for example, handling vulgar abbreviations like BS (short for bullshit) and WTH (short for Who the hell).

Future Works and Conclusion
Our work here aims towards keeping a civil environment in public discourse forums and social media platforms. Our goal was to build a filtering system that could work alongside human moderators to reduce their workload, be objective and independent of user reporting, and perform well on previously unseen social media streams. There is much work to do in this area: annotation of a large random sample of the troll tweets can give a more thorough estimate of model performance, and various forms of domain adaptation like selftraining might be applied to improve the performance of the model. We have used word n-grams for features in our baseline models, which can be improved by using features obtained from domainspecific lexicons. There are lexicons of abusive words (Wiegand et al., 2018)-which can be used to create non-sequential models with smaller feature sets. Whether these simpler models are better is yet to be proven -as Agrawal and Awekar (2018) has shown that vocabulary of words used for cyberbullying varies significantly from one social media platform to another. They have also showed that swear words are not necessary to be uncivil in online social media-hence these types of detection techniques should not rely on such hand-crafted features.
One research question that follows this work is to observe whether incivility affects user engagement in social media. Prior research has observed that receiving replies can have effects in a user's engagement Sadeque et al., 2015), and the language of these replies can also have consequences (Arguello et al., 2006). Habernal et al. (2018) has showed that 48% of comments that included ad hominem attacks ended the argument -which is indicative of lower engagement by the entire community. Hence, we believe that incivility has a significant influence on user engagement, and in turn may contribute to a community's sustainability. This is yet to be proven, and more work needs to be performed to prove or disprove this hypothesis.
In this paper, we have presented a recurrent neural that can identify incivilities in public discourse. Though trained on a corpus of newspaper comments, we have initial evidence that it also performs well in detecting incivilities in Twitter. We believe our model will be able to serve as a wide-range incivility filter in other social media platforms.