Preemptive Toxic Language Detection in Wikipedia Comments Using Thread-Level Context

We address the task of automatically detecting toxic content in user generated texts. We fo cus on exploring the potential for preemptive moderation, i.e., predicting whether a particular conversation thread will, in the future, incite a toxic comment. Moreover, we perform preliminary investigation of whether a model that jointly considers all comments in a conversation thread outperforms a model that considers only individual comments. Using an existing dataset of conversations among Wikipedia contributors as a starting point, we compile a new large-scale dataset for this task consisting of labeled comments and comments from their conversation threads.


Introduction
Due to the ever-growing amount of user generated content online, manual moderation of such content is becoming increasingly difficult to scale up. On the other hand, the relative anonymity and lack of personal contact between participants of web conversations lowers inhibitions and increases the risk of toxic behavior, making adequate moderation increasingly important. Consequently, automated detection of toxic language in user generated content is an increasingly important area of research. While automated classification models are unlikely to ever fully replace human moderators, they can make their task easier by suggesting which content to prioritize for moderation.
The typical way to approach this problem is via supervised machine learning, where an input to a model is a user-generated text, and the output is a classification decision (toxic or non-toxic) or a numerical toxicity score. In this paper, we explore two possible extensions of this approach: preemptive classification and thread-level models.
While practically very useful, standard models are only applicable in a post-hoc scenario, i.e., to detect a toxic comment after if has already been posted. An alternate approach would be to have models detect situations that are likely to lead to toxic comments. If successful, such models would enable moderators to preemptively focus on potentially problematic discussion threads and then either intervene and guide the discussion away from conflict or respond in near real-time after the toxic comment is posted. Large-scale implementation of such near real-time moderation might be unnecessary and require too many moderators. However, for limited parts of discussions that are known to pertain to specially vulnerable social groups, this might be a feasible approach. Our first research question is whether such preemptive toxic comment detection is viable.
The second research question pertains to the benefits of including thread-level information when detecting toxic comments. Namely, most existing models consider every comment in isolation, therefore ignoring the context provided by the other comments in the discussion. For post-hoc models, while useful, this additional information may not be crucial, as the main indicators of toxicity are most present in the text of the comment being classified rather than in the rest of the thread. In the preemptive scenario, however, the model has access only to comments that appeared before a toxic one. We hypothesize that considering the entire thread of comments might be of greater importance in this case.
The contribution of this paper is threefold. First, using a large data set of conversations among Wikipedia contributors, we compile and make publicly available a new dataset with complete discussion threads and with semi-automatically generated toxicity labels. Secondly, we explore the viability of models for the preemptive toxic detection task. Third, we investigate the potential benefits of including thread-level information into models.
All the above-mentioned approaches deal with the post-hoc scenario. Other work we are aware of that explores the preemptive scenario is that of Zhang et al. (2018a). There, the task is to predict -given an initial courteous exchange of two user comments -whether the third comment will be toxic. The authors create a manually labeled data set and perform an extensive study on which pragmatic and rhetorical devices are indicative of conversation toxicity. Moreover, there is the work of (Liu et al., 2018), where a logistic regression classifier with a rich feature set (including thread level features) is evaluated on a data set of manually labeled 30000 Instagram comments. In contrast, the data set produced in our work is much bigger, but has only silver labels.
In our work we consider the use of thread-level information for toxic comment detection. Within the scope of this work we limit ourselves to simple mechanisms for including this information into deep learning models. Recently, deep learning models have been proposed that leverage graph structures, such as TreeLSTM (Tai et al., 2015) and GraphSAGE (Hamilton et al., 2017), which might be useful for modeling thread-level structure in our task. We leave the investigation of this possibility for future work.

Dataset
At the time of writing we were not aware of the data set from (Liu et al., 2018). At first we considered using the dataset from (Zhang et al., 2018a), but found it rather small (∼1200 examples) for deep learning approaches. Furthermore, this dataset was constructed using a very carefully designed methodology for a specific experiment -detecting whether a toxic comment will appear given a courteous initial exchange of two comments. We are interested in a more general case, where conversation threads might be longer and not necessarily start in a courteous manner. Moreover, we aimed at a setting which would better reflect the realistic working conditions in which our models would be used and allow us to measure their practical impact. Consequently, we decided to create a new dataset from the data collected by Hua et al. (2018). It contains the entire conversational history of comments on Wikipedia modeled as a graph of actions. The possible actions are Creation, Addition, Modification, Deletion, and Restoration. Automatically derived toxicity scores are also provided for each example.
We apply the following steps to this dataset: Step 1. Filter the data to remove all threads with less than 2 different participants. This leaves ∼8.7M threads.
Step 2. Apply all Modification actions, to update the comments to their most recent version.
Step 3. Flag comments that were deleted. A comment is considered deleted if there is a Deletion action on it, without a subsequent Restoration action that would undo the effect.
Step 4. Split the threads into the train (70%), dev (15%), and test (15%) sets. The split is done across time: the test set contains the most recent threads, while the train set contains the least recent.
Step 5. Semi-automatically label the examples for toxicity. An example is considered toxic if its toxicity or severe toxicity scores are above 0.64 or 0.92, respectively. 1 and it was deleted by a person who is not the comment author. This heuristic takes into account two signals: (1) the fact that a toxic classifier has high confidence for a comment and (2) the fact the comment was deleted. Considering only deleted examples as toxic would yield noisy labels, as comments are often deleted for reasons other than being toxic. Manual inspection of the silver-labeled dataset reveals that the combination of the toxicity classifier and observed deletion is effective in identifying some of the toxic comments. However, this approach fails to identify those toxic comments which were not deleted or those for which the toxicity classifier failed. The former issue is not problematic, as it was shown by Hua et al. (2018) that toxic comments on Wikipedia get deleted by the community very quickly. Thus toxic comments that are not deleted are quite rare. The latter issue, however, represents a limitation of our work. Our results apply only to those types of toxic language that are detectable by current posthoc models. Extending this data set to account for more complex types of toxic language would require considerable annotation effort and we leave it as a possibility for future work.
Step 6. Generate examples from each thread in the train/dev/test set for the (1) preemptive scenario and (2) post-hoc scenario as shown in Figure 1.   ceding comments available for each comment in this data can vary from 0 to over 100 (median is 2). Consequently, to differentiate from the setting in the next step, we will refer to this setting as the real-context setting.
Step 7. While the previous setting is more realistic, in order to better assess their full potential, we wished to evaluate the thread level models in a setting where the context provided by preceding comments is always available. To this end, we filter the examples from the previous step such that only those are left that have at least L min = 2 comments on the path to the root. 2 We will refer to the datasets obtained in this step as being in the rich-context setting.
As the label distribution of the examples obtained in this way is extremely skewed (positive examples account for 0.5 to 1 percent of the data, depending on the setting and scenario), we undersample the negative class. For completeness, we also retain the non-undersampled versions of the dev and test sets for some of the experiments.
Lastly, to additionally evaluate the quality of the silver labels we manually labeled 100 examples from the balanced version of the data set. We found that on these examples the silver labels have 0.51 precision and 1.00 recall. This yields 0.67 F1 measure and is somewhat lower than the expected 0.85 obtained for this classifier in (Hua et al., 2018). The difference indicates that the thresholds from (Hua et al., 2018) obtained on non-deleted comments from Wikipedia may not perform equally well on deleted comments. To address this and increase the quality of the labels, more deleted comments should be manually labeled and thresholds retuned using, e.g., the same error rate method of (Wulczyn et al., 2017a).
Some statistics of the finally generated data set are given in Table 1

Models
We implement several baseline models to get some preliminary results on this data. Our simplest model is a linear support vector machine (SVM) on TF-IDF weighted unigrams and bigrams. We include the most frequent 10k n-grams into the model, and tune the C hyperparameter on the dev data. This model ignores thread context, even when it is available.
For the deep learning models we use an neural network based encoder to derive a vector representation for each comment in our data. We denote this encoder as enc com . For models that ignore preceding comments, the output of this encoder is fed directly to linear and softmax layers and produces a classification decision for each comment. Thus, the output of our model for a comment c, which is a sequence of word embeddings, could be written as: For models that take preceding comments into account, the input is not a single comment but a sequence of comments, t = (c 1 , . . . , c n ), which includes the comment to classify, c n , and all of its ancestors, (c 1 , . . . , c n−1 ). We first apply enc com to each individual c i , obtaining comment representations r i = enc com (c i ). We then feed the sequence s = (r 1 , . . . , r n ) as features into another encoder, which we denote by enc thr . The output of the model for the given input is similar as before: For implementing the encoder, we performed preliminary experiments with convolutional neural networks (CNN) (Krizhevsky et al., 2012), long short-term memory networks (LSTM) (Hochreiter and Schmidhuber, 1997), and gated recurrent units (GRU) (Cho et al., 2014), tuning hyperparameters on the development data. We found that, on the development data for this task, GRU and LSTM perform similarly and slightly better than CNN. We also found that bidirectional recurrent models perform slightly better than standard ones. To represent the words we use the freely available 50dimensional GloVE embeddings (Pennington et al., 2014) trained on 6 billion tokens. Preliminary experiments also reveal models perform better when the embeddings are also updated during training. For our final experiments we use two BiLSTM models with a cell/hidden-state size of 50 to implement enc com and enc thr . We use Adam (Kingma and Ba, 2015) to train the models with a learning rate of 0.001, minibatch size of 128, and early stopping using the dev set. We denote the variants of the model that are thread-agnostic and thread-aware as BiLSTM and BiLSTM-context, respectively. All models are implemented in PyTorch (Paszke et al., 2017) and the code is available online. 4

Results
The results are given in Table 4. Each column represents one dataset variant. All differences within the same variant are statistically significant at p<0.05 (tested used bootstrap resampling).
While differences across different dataset variants are not directly comparable, there is a tendency for models to perform much better in the post-hoc scenario than in the preemptive scenario, which is expected. Preemptive models are, however, able manually labeled to beat the random baseline and achieve scores that are numerically similar to those of Zhang et al. (2018a) on their data.
The BiLSTM-context model performs similarly or worse than the BiLSTM model in all cases except the preemptive real case where context does help, but both LSTM-based models are outperformed by a simple SVM. This indicates that the additional information provided by the thread context is, in this case, not very useful for determining the correct class. Manual inspection of the data set confirms that humans could determine the toxicity of most comments without referring to the thread for additional context.This intuition is invalid in  Table 4: Results in various settings and scenarios. Random is the expected performance of choosing a random class with uniform probability across the classes. The numbers are F1-scores on perfectly balanced test data. cases when the thread context for preemptive detection already contains a toxic comment. A presence of a toxic comment in a thread is a good indicator of a situation where more toxicity will follow. Thus considering the entire thread leads to better performance in such cases. This, however, is not true preemptive detection, as toxicity already occurred earlier in thread. Consequently, we filtered out such cases from the data by requiring comments that are positive examples for the preemptive case to have no toxic ancestors (as described in Chapter 3). It is however worth mentioning that, for this reason, our preliminary experiments which omit this filtering step indeed showed more noticeable benefits of having the thread available in the preemptive case.
We also note that the unbalanced nature of this data has a very negative effect on performance in a practical setting. For example, even after tuning the classification threshold to maximize F1 on unbalanced dev data, the F1-score for the best posthoc model on the unbalanced test is still below 0.5. Thus, more work is required to make models for this task that are applicable in a real-world setting.

Conclusion
We compiled a large semi-automatically labeled dataset for studying preemptive toxic language detection in Wikipedia conversations. We explored two types of deep learning models for this task: those that only consider a single comment and those that take into account the context by considering preceding comments in a conversation. In our experiments, the context-sensitive models did not significantly outperform context-agnostic ones. While all preemptive models would beat a random baseline, their performance is still too low for practical applications.
There are numerous possibilities for future work. One is to employ more sophisticated graph-based deep learning methods such as GraphSAGE (Hamilton et al., 2017). Another direction would be exploring ways to better address the class unbalance typical for this task. Yet another possibility would be to enrich the input features with information available about the user who is commenting, e.g., whether they had toxic comments in the past, or their personality profile derived from text using models such as that of Gjurković andŠnajder (2018). Finally, combining deep learning with discourse and pragmatic features, such as those of Zhang et al. (2018a), might be a good next step to improve the models in the preemptive setting.