Improved Abusive Comment Moderation with User Embeddings

Experimenting with a dataset of approximately 1.6M user comments from a Greek news sports portal, we explore how a state of the art RNN-based moderation method can be improved by adding user embeddings, user type embeddings, user biases, or user type biases. We observe improvements in all cases, with user embeddings leading to the biggest performance gains.


Introduction
News portals often allow their readers to comment on articles, in order to get feedback, engage their readers, and build customer loyalty. User comments, however, can also be abusive (e.g., bullying, profanity, hate speech), damaging the reputation of news portals, making them liable to fines (e.g., when hosting comments encouraging illegal actions), and putting off readers. Large news portals often employ moderators, who are frequently overwhelmed by the volume and abusiveness of comments. 1 Readers are disappointed when nonabusive comments do not appear quickly online because of moderation delays. Smaller news portals may be unable to employ moderators, and some are forced to shut down their comments. 2 In previous work (Pavlopoulos et al., 2017a), we introduced a new dataset of approx. 1.6M manually moderated user comments from a Greek sports news portal, called Gazzetta, which we made publicly available. 3 Experimenting on that dataset and the datasets of Wulczyn et al. (2017), which contain moderated English Wikipedia comments, we showed that a method based on a Recurrent Neural Network (RNN) outperforms DETOX (Wulczyn et al., 2017), the previous state of the art in automatic user content moderation. 4 Our previous work, however, considered only the texts of the comments, ignoring user-specific information (e.g., number of previously accepted or rejected comments of each user). Here we add user embeddings or user type embeddings to our RNN-based method, i.e., dense vectors that represent individual users or user types, similarly to word embeddings that represent words (Mikolov et al., 2013;Pennington et al., 2014). Experiments on Gazzetta comments show that both user embeddings and user type embeddings improve the performance of our RNN-based method, with user embeddings helping more. User-specific or user-type-specific scalar biases also help to a lesser extent.

Dataset
We first discuss the dataset we used, to help acquaint the reader with the problem.
The dataset contains Greek comments from Gazzetta (Pavlopoulos et al., 2017a). There are approximately 1.45M training comments (covering Jan. 1, 2015 to Oct. 6, 2016); we call them G-TRAIN ( Table 1). An additional set of 60,900 comments (Oct. 7 to Nov. 11, 2016) was split to development set (G-DEV, 29,700 comments) and test set (G-TEST, 29,700). 5 Each comment has a gold label ('accept', 'reject'). The user ID of the author of each comment is also available, but user IDs were not used in our previous work.
When experimenting with user type embeddings or biases, we group the users into the fol-4 Two of the co-authors of Wulczyn et al. (2017) are with Jigsaw, who recently announced Perspective, a system to detect toxic comments. Perspective is not the same as DETOX (personal communication), but we were unable to obtain scientific articles describing it. 5 The remaining 1,500 comments are not used here. Smaller subsets of G-TRAIN and G-TEST are also available (Pavlopoulos et al., 2017a), but are not used in this paper. The Wikipedia comment datasets of Wulczyn et al. (2017) cannot be used here, because they do not provide user IDs.

Methods
RNN: This is the RNN-based method of our previous work (Pavlopoulos et al., 2017a). It is a chain of GRU cells (Cho et al., 2014) that transforms the tokens w 1 . . . , w k of each comment to the hidden states h 1 . . . , h k (h i ∈ R m ). Once h k has been computed, a logistic regression (LR) layer estimates the probability that comment c should be rejected: ueRNN: This is the RNN-based method with user embeddings added. Each user u of the training set with T (u) > 10 is mapped to a user-specific embedding v u ∈ R d . Users with T (u) ≤ 10 are mapped to a single 'unknown' user embedding. The LR layer is modified as follows; v u is the embedding of the author of c; and W v ∈ R 1×d .
teRNN: This is the RNN-based method with user type embeddings added. Each user type t is mapped to a user type embedding v t ∈ R d . The ubRNN: This is the RNN-based method with user biases added. Each user u of the training set with T (u) > 10 is mapped to a user-specific bias b u ∈ R. Users with T (u) ≤ 10 are mapped to a single 'unknown' user bias. The LR layer is modified as follows, where b u is the bias of the author of c.
We expected ubRNN to learn higher (or lower) b u biases for users whose posts were frequently rejected (accepted) in the training data, biasing the system towards rejecting (accepting) their posts. tbRNN: This is the RNN-based method with user type biases. Each user type t is mapped to a user type bias b t ∈ R. The LR layer is modified as follows; b t is the bias of the type of the author.
We expected tbRNN to learn a higher b t for the red user type (frequently rejected), and a lower b t for the green user type (frequently accepted), with the biases of the other two types in between. In all methods above, we use 300-dimensional word embeddings, user and user type embeddings with d = 300 dimensions, and m = 128 hidden units in the GRU cells, as in our previous experiments (Pavlopoulos et al., 2017a), where we tuned all hyper-parameters on 2% held-out training comments. Early stopping evaluates on the same heldout subset. User and user type embeddings are randomly initialized and updated by backpropagation. Word embeddings are initialized to the WORD2VEC embeddings of our previous work (Pavlopoulos et al., 2017a), which were pretrained on 5.2M Gazzetta comments. Out of vocabulary words, meaning words not encountered or encountered only once in the training set and/or words with no initial embeddings, are mapped (during both training and testing) to a single randomly initialized word embedding, updated by backpropagation. We use Glorot initialization (Glorot and   Bengio, 2010) for other parameters, cross-entropy loss, and Adam (Kingma and Ba, 2015). 7 uBASE: For a comment c authored by user u, this baseline returns the rejection rate R(u) of the author's training comments, if there are T (u) > 10 training comments of u, and 0.5 otherwise.
tBASE: This baseline returns the following probabilities, considering the user type t of the author. Table 3 shows the AUC scores (area under ROC curve) of the methods considered. Using AUC allows us to compare directly to the results of our previous work (Pavlopoulos et al., 2017a) and the work of Wulczyn et al. (2017). Also, AUC considers performance at multiple classification thresholds t (rejecting comment c when P (reject|c) ≥ t, for different t values), which gives a more complete picture compared to reporting precision, recall, or F-scores for a particular t only. Accuracy is not an appropriate measure here, because of class imbalance (Table 1). For methods that involve random initializations (all but the baselines), the results are averaged over three repetitions; we also report the standard error across the repetitions. User-specific information always improves our original RNN-based method (Table 3), but the best results are obtained by adding user embeddings (ueRNN). Figure 1 visualizes the user embeddings learned by ueRNN. The two dimensions of Fig. 1 correspond to the two principal components of the user embeddings, obtained via PCA.The colors and numeric labels reflect the rejection rates R(u) of the corresponding users. Moving from left to right in Fig. 1, the rejection rate increases, indicating that the user embeddings of ueRNN capture mostly the rejection rate R(u). This rate (a single scalar value per user) can also be captured by the simpler user-specific biases of ubRNN, which explains why ubRNN also performs well (second best results in Table 3). Nevertheless, ueRNN performs better than ubRNN, suggesting that user embeddings capture more information than just a userspecific rejection rate bias. 8 Three of the user types (Red, Yellow, Green) in effect also measure R(u), but in discretized form (three bins), which also explains why user type embeddings (teRNN) also perform well (third best method). The performance of tbRNN is close to that of teRNN, suggesting again that most of the information captured by user type embeddings can also be captured by simpler scalar user-typespecific biases. The user type biases b t learned by tbRNN are shown in Table 4. The bias of the Red type is the largest, the bias of the Green type is the smallest, and the biases of the Unknown and Yellow types are in between, as expected (Section 3). The same observations hold for the average userspecific biases b u learned by ubRNN (Table 4).

Results and Discussion
Overall, Table 3 indicates that user-specific information (ueRNN, ubRNN) is better than user-type information (teRNN, tbRNN), and that embeddings (ueRNN, teRNN) are better than the scalar biases (ubRNN, tbRNN), though the differences are small. All the RNN-based methods outperform the two baselines (uBASE, tBASE), which do not consider the texts of the comments.
Let us provide a couple of examples, to illustrate the role of user-specific information. We en-  countered a comment saying just "Ooooh, down to Pireaus. . . " (translated from Greek), which the moderator had rejected, because it is the beginning of an abusive slogan. The rejection probability of RNN was only 0.34, presumably because there are no clearly abusive expressions in the comment, but the rejection probability of ueRNN was 0.72, because the author had a very high rejection rate.
On the other hand, another comment said "Indeed, I know nothing about the filth of Greek soccer." (translated, apparently not a sarcastic comment). The original RNN method marginally rejected the comment (rejection probability 0.57), presumably because of the 'filth' (comments talking about the filth of some sport or championship are often rejected), but ueRNN gave it a very low rejection probability (0.15), because the author of the comment had a very low rejection rate.

Related work
In previous work (Pavlopoulos et al., 2017a), we showed that our RNN-based method outperforms DETOX (Wulczyn et al., 2017), the previous state of the art in user content moderation. DETOX uses character or word n-gram features, no userspecific information, and an LR or MLP classifier. Other related work on abusive content moderation was reviewed extensively in our previous work (Pavlopoulos et al., 2017a). Here we focus on previous work that considered user-specific features and user embeddings. Dadvar et al. (2013) detect cyberbullying in YouTube comments, using an SVM and features examining the content of each comment (e.g., second person pronouns followed by profane words, common bullying words), but also the profile and history of the author of the comment (e.g., age, frequency of profane words in past posts). Waseem et al. (2016) detect hate speech tweets. Their best method is an LR classifier, with character n-grams and a feature indicating the gender of the author; adding the location of the author did not help. Cheng et al. (2015) predict which users will be banned from on-line communities. Their best system uses a Random Forest or LR classifier, with features examining the average readability and sentiment of each user's past posts, the past activity of each user (e.g., number of posts daily, proportion of posts that are replies), and the reactions of the community to the past actions of each user (e.g., up-votes, number of posts rejected). Lee et al. (2014) and Napoles et al. (2017) include similar user-specific features in classifiers intended to detect high quality on-line discussions. Amir et al. (2016) detect sarcasm in tweets. Their best system uses a word-based Convolutional Neural Network (CNN). The feature vector produced by the CNN (representing the content of the tweet) is concatenated with the user embedding of the author, and passed on to an MLP that classifies the tweet as sarcastic or not. This method outperforms a previous state of the art sarcasm detection method (Bamman and Smith, 2015) that relies on an LR classifier with handcrafted content and user-specific features. We use an RNN instead of a CNN, and we feed the comment and user embeddings to a simpler LR layer (Eq. 2), instead of an MLP. Amir et al. discard unknown users, unlike our experiments, and consider only sarcasm, whereas moderation also involves profanity, hate speech, bullying, threats etc.
User embeddings have also been used in: conversational agents (Li et al., 2016); sentiment analysis (Chen et al., 2016); retweet prediction (Zhang et al., 2016); predicting which topics a user is likely to tweet about, the accounts a user may want to follow, and the age, gender, political affiliation of Twitter users (Benton et al., 2016).
Our previous work (Pavlopoulos et al., 2017a) also discussed how machine learning can be used in semi-automatic moderation, by letting moderators focus on 'difficult' comments and automatically handling comments that are easier to accept or reject. In more recent work (Pavlopoulos et al., 2017b) we also explored how an attention mechanism can be used to highlight possibly abusive words or phrases when showing 'difficult' comments to moderators.

Conclusions
Experimenting with a dataset of approx. 1.6M user comments from a Greek sports news portal, we explored how a state of the art RNN-based moderation method can be improved by adding user embeddings, user type embeddings, user biases, or user type biases. We observed improvements in all cases, but user embeddings were the best.
We plan to compare ueRNN to CNN-based methods that employ user embeddings (Amir et al., 2016), after replacing the LR layer of ueRNN by an MLP to allow non-linear combinations of comment and user embeddings.