Deep Learning for User Comment Moderation

Experimenting with a new dataset of 1.6M user comments from a Greek news portal and existing datasets of EnglishWikipedia comments, we show that an RNN outperforms the previous state of the art in moderation. A deep, classification-specific attention mechanism improves further the overall performance of the RNN. We also compare against a CNN and a word-list baseline, considering both fully automatic and semi-automatic moderation.


Introduction
User comments play a central role in social media and online discussion fora. News portals and blogs often also allow their readers to comment in order to get feedback, engage their readers, and build customer loyalty. User comments, however, and more generally user content can also be abusive (e.g., bullying, profanity, hate speech). Social media are increasingly under pressure to combat abusive content. News portals also suffer from abusive user comments, which damage their reputation and make them liable to fines, e.g., when hosting comments encouraging illegal actions. They often employ moderators, who are frequently overwhelmed by the volume of comments. Readers are disappointed when non-abusive comments do not appear quickly online because of moderation delays. Smaller news portals may be unable to employ moderators, and some are forced to shut down their comments sections entirely. 1 We examine how deep learning (Goodfellow et al., 2016;Goldberg, 2016) can be used to moderate user comments. We experiment with a new dataset of approx. 1.6M manually moderated user 1 See, for example, http://niemanreports.org/ articles/the-future-of-comments/. comments from a Greek sports portal (Gazzetta), which we make publicly available. 2 Furthermore, we provide word embeddings pre-trained on 5.2M comments from the same portal. We also experiment on the datasets of Wulczyn et al. (2017), which contain English Wikipedia comments labeled for personal attacks, aggression, toxicity.
In a fully automatic scenario, a system directly accepts or rejects comments. Although this scenario may be the only available one, e.g., when portals cannot afford moderators, it is unrealistic to expect that fully automatic moderation will be perfect, because abusive comments may involve irony, sarcasm, harassment without profanity etc., which are particularly difficult for machines to handle. When moderators are available, it is more realistic to develop semi-automatic systems to assist rather than replace them, a scenario that has not been considered in previous work. Comments for which the system is uncertain (Fig. 1) are shown to a moderator to decide; all other comments are accepted or rejected by the system. We discuss how moderation systems can be tuned, depending on the availability and workload of moderators. We also introduce additional evaluation On both Gazzetta and Wikipedia comments and for both scenarios (automatic, semi-automatic), we show that a recursive neural network (RNN) outperforms the system of Wulczyn et al. (2017), the previous state of the art for comment moderation, which employed logistic regression (LR) or a multi-layered Perceptron (MLP). We also propose an attention mechanism that improves the overall performance of the RNN. Our attention differs from most previous ones (Bahdanau et al., 2015;Luong et al., 2015) in that it is used in text classification, where there is no previously generated output subsequence to drive the attention, unlike sequence-to-sequence models (Sutskever et al., 2014). In effect, our attention mechanism detects the words of a comment that affect mostly the classification decision (accept, reject), by examining them in the context of the particular comment.
Our main contributions are: (i) We release a new dataset of 1.6M moderated user comments. (ii) We are among the first to apply deep learning to user comment moderation, and we show that an RNN with a novel classification-specific attention mechanism outperforms the previous state of the art. (iii) Unlike previous work, we also consider a semi-automatic scenario, along with threshold tuning and evaluation measures for it.

Datasets
We first discuss the datasets we used, to help acquaint the reader with the problem.

Gazzetta dataset
There are approx. 1.45M training comments (covering Jan. 1, 2015to Oct. 6, 2016 in the Gazzetta dataset; we call them G-TRAIN-L (Table 1). Some experiments use only the first 100K comments of G-TRAIN-L, called G-TRAIN-S. An additional set of 60,900 comments (Oct. 7 to Nov. 11, 2016) was split to development (G-DEV, 29,700 comments), large test (G-TEST-L, 29,700), and small test set (G-TEST-S, 1,500). Gazzetta's moderators (2 full-time, plus journalists occasionally helping) are occasionally instructed to be stricter (e.g., during violent events). To get a more accurate view of performance in normal situations, we manually re-moderated (labeled as 'accept' or 'reject') the comments of G-TEST-S, producing G-TEST-S-R. The reject ratio is approximately 30% in all subsets, except for G-TEST-S-R where it drops to 22%, because there are no occasions where the moderators were instructed to be stricter in G-TEST-S-R.
Each G-TEST-S-R comment was re-moderated by 5 annotators. Krippendorff's (2004) alpha was 0.4762, close to the value (0.45) reported by Wulczyn et al. (2017) for Wikipedia comments. Using Cohen's Kappa (Cohen, 1960), the mean pairwise agreement was 0.4749. The mean pairwise percentage of agreement (% of comments each pair of annotators agreed on) was 81.33%. Cohen's Kappa and Krippendorff's alpha lead to moderate scores, because they account for agreement by chance, which is high when there is class imbalance (22% reject, 78% accept in G-TEST-S-R).
We also provide 300-dimensional word embeddings, pre-trained on approx. 5.2M comments (268M tokens) from Gazzetta using WORD2VEC (Mikolov et al., 2013a,b). 3 This larger dataset cannot be used to train classifiers, because most of its comments are from a period (before 2015) when Gazzetta did not employ moderators. Wulczyn et al. (2017) created three datasets containing English Wikipedia talk page comments.

Wikipedia datasets
Attacks dataset: This dataset contains approx. 115K comments, which were labeled as personal attacks (reject) or not (accept) using crowdsourcing. Each comment was labeled by at least 10 annotators. Inter-annotator agreement, measured on a random sample of 1K comments using Krippendorff's (2004) alpha, was 0.45. The gold label of each comment is determined by the majority of annotators, leading to binary labels (accept, reject). Alternatively, the gold label is the percentage of annotators that labeled the comment as 'accept' (or 'reject'), leading to probabilistic labels. 4 The dataset is split in three parts (Table 1): training (W-ATT-TRAIN, 69,526 comments), development (W- 23,160), and test (W-ATT-TEST, 23,178 comments). In all three parts, the rejected comments are 12%, but this ratio is artificial (in effect, Wulczyn et al. oversampled comments posted by banned users), unlike Gazzetta subsets where the truly observed accept/reject ratios are used.
Toxicity dataset: This dataset was created like the previous one, but contains more comments (159,686), now labeled as toxic (reject) or not (accept). Inter-annotator agreement was not reported. Again, binary or probabilistic gold labels can be used. The dataset is split in three parts (Table 1): training (W-TOX-TRAIN, 95,692 comments), 32,128), and test (W- 31,866). In all three parts, the rejected (toxic) comments are 10%, again an artificial ratio.
Wikipedia comments are longer (median 38 and 39 tokens for attacks, toxicity) compared to Gazzetta's (median 25). Wulczyn et al. (2017) also created an 'aggression' dataset containing the same comments as the personal attacks one, but now labeled as aggressive or not. The (probabilistic) labels of the two datasets are very highly correlated (0.8992 Spearman, 0.9718 Pearson) and we do not consider the aggression dataset further.

Methods
We experimented with an RNN operating on word embeddings, the same RNN enhanced with our attention mechanism (a-RNN), several variants of a-RNN, a vanilla convolutional neural network (CNN) also operating on word embeddings, the DETOX system of Wulczyn et al. (2017), and a baseline that uses word lists with precision scores.

DETOX
DETOX (Wulczyn et al., 2017) was the previous state of the art in comment moderation, in the sense that it had the best reported results on the Wikipedia datasets (Section 2.2), the largest previous publicly available datasets of moderated user comments. 5 DETOX represents each comment as a 4 We also construct probabilistic gold labels (in addition to binary ones) for G-TEST-S-R, where there are 5 annotators. 5 Two of the co-authors of Wulczyn et al. (2017) are with Jigsaw, who recently announced Perspective, a system to detect 'toxic' comments. Perspective is not the same as DETOX (personal communication), but we were unable to obtain scientific articles describing it. We have applied for access to its bag of word n-grams (n ≤ 2, each comment becomes a bag containing its 1-grams and 2-grams) or a bag of character n-grams (n ≤ 5, each comment becomes a bag containing character 1-grams, . . . , 5-grams). DETOX can rely on a logistic regression (LR) or multi-layer Perceptron (MLP) classifier, and use binary or probabilistic gold labels (Section 2.2) during training. We used the DETOX implementation of Wulczyn et al. and the same grid search to tune the hyper-parameters that select word or character n-grams, classifier (LR or MLP), and gold labels (binary or probabilistic). For Gazzetta, only binary gold labels were possible, since G-TRAIN-L and G-TRAIN-S have a single gold label per comment. Unlike Wulczyn et al., we tuned the hyper-parameters by evaluating (computing AUC and Spearman, Section 4) on a random 2% of held-out comments of W-ATT-TRAIN, W-TOX-TRAIN, or G-TRAIN-S, instead of the development subsets, to be able to obtain more realistic results from the development sets while developing the methods. The tuning always selected character n-grams, as in the work of Wulczyn et al., and LR to MLP, whereas Wulczyn et al. reported slightly higher performance for the MLP on W-ATT-DEV. 6 The tuning also selected probabilistic labels when available (Wikipedia datasets), as in the work of Wulczyn et al.

RNN:
The RNN method is a chain of GRU cells (Cho et al., 2014) that transforms the tokens w 1 . . . , w k of each comment to hidden states h 1 . . . , h k , followed by an LR layer that uses h k to classify the comment (accept, reject). Formally, given the vocabulary V , a matrix E ∈ R d×|V | containing d-dimensional word embeddings, an initial h 0 , and a comment c = w 1 , . . . , w k , the RNN computes h 1 , . . . , h k as follows (h t ∈ R m ): whereh t ∈ R m is the proposed hidden state at position t, obtained by considering the word embedding x t of token w t and the previous hidden state API (http://www.perspectiveapi.com/). 6 Wulczyn et al. (2017) report results only on W-ATT-DEV. We repeated the tuning by evaluating on W-ATT-DEV, and again character n-grams with LR were selected. h t−1 ; denotes element-wise multiplication; r t ∈ R m is the reset gate (for r t all zeros, it allows the RNN to forget the previous state h t−1 ); z t ∈ R m is the update gate (for z t all zeros, it allows the RNN to ignore the new proposedh t , hence also x t , and copy h t−1 as h t ); σ is the sigmoid func- Once h k has been computed, the LR layer estimates the probability that comment c should be rejected, with W p ∈ R 1×m , b p ∈ R: a-RNN: When the attention mechanism is added, the LR layer considers the weighted sum h sum of all the hidden states, instead of just h k (Fig. 2): The weights a t are produced by an attention mechanism, which is an MLP with l layers: The softmax operates across all the a (l) t (t = 1, . . . , k), making the attention weights a t sum to 1. Our attention mechanism differs from most previous ones (Mnih et al., 2014;Bahdanau et al., 2015;Xu et al., 2015;Luong et al., 2015) in that it is used in a classification setting, where there is no previously generated output subsequence (e.g., partly generated translation) to drive the attention (e.g., assign more weight to source words to translate next), unlike seq2seq models (Sutskever et al., 2014). It assigns larger weights a t to hidden states h t corresponding to positions where there is more evidence that the comment should be accepted or rejected. Yang et al. (2016) use a similar attention mechanism, but ours is deeper. In effect they always set l = 2, whereas we allow l to be larger (tuning selects l = 4). 7 On the other hand, the attention 7 Yang et al. use tanh instead of ReLU in Eq. 2, which works worse in our case, and no bias b (l) in the l-th layer. ...
x 1 x 2 x k ... acceptance probability rejection probability mechanism of Yang et al. is part of a classification method for longer texts (e.g., product reviews). Their method uses two GRU RNNs, both bidirectional (Schuster and Paliwal, 1997), one turning the word embeddings of each sentence to a sentence embedding, and one turning the sentence embeddings to a document embedding, which is then fed to an LR layer. Yang et al. use their attention mechanism in both RNNs, to assign attention scores to words and sentences. We consider shorter texts (comments), we have a single RNN, and we assign attention scores to words only. 8 da-RNN: In a variant of a-RNN, called da-RNN (direct attention), the input to the first layer of the attention mechanism is the embedding x t of word w t , rather than h t (cf. Eq. 2; W (1,x) ∈ R r×d ): Intuitively, the attention of a-RNN considers each word embedding x t in its (left) context, modelled by h t , whereas the attention of da-RNN considers directly x t without its context, but h sum is still the weighted sum of the hidden states (Eq. 1).

eq-RNN:
In another variant of a-RNN, called eq-RNN, we assign equal attention to all the hidden states. The feature vector of the LR layer is now the average h sum = 1 k k t=1 h t (cf. Eq. 1). da-CENT: For ablation testing, we also experiment with a variant, called da-CENT, that does not use the hidden states of the RNN. The input to the attention mechanism is now directly the embedding x t instead of h t (as in da-RNN, Eq. 3), and h sum is the weighted average (centroid) of word embeddings h sum = k t=1 a t x t (cf. Eq. 1). 9 eq-CENT: For further ablation, we also experiment with eq-CENT, which uses neither the RNN nor the attention mechanism. The feature vector of the LR layer is now simply the average of word embeddings h sum = 1 k k t=1 x t (cf. Eq. 1). We set l = 4, d = 300, m = r = 128, having tuned the hyper-parameters of RNN and a-RNN on the same 2% held-out training comments used to tune DETOX; da-RNN, eq-RNN, da-CENT, and eq-CENT use the same hyper-parameter values as a-RNN, to make their results more directly comparable and save time. We use Glorot initialization (Glorot and Bengio, 2010), crossentropy loss, and Adam (Kingma and Ba, 2015). 10 Early stopping evaluates on the same held-out subsets. For Gazzetta, word embeddings are initialized to the WORD2VEC embeddings we provide (Section 2.1). For the Wikipedia datasets, they are initialized to GLOVE embeddings (Pennington et al., 2014). 11 In both cases, the embeddings are updated during backpropagation. Out of vocabulary (OOV) words, meaning words not encountered in the training set and/or words we have no initial embeddings for, are mapped (during training and testing) to a single randomly initialized embedding, which is also updated during training. 12

CNN
We also compare against a vanilla CNN operating on word embeddings. We describe the CNN only briefly, because it is very similar to that of of Kim (2014); see also Goldberg (2016) for an introduction to CNNs, and Zhang and Wallace (2015).
For Wikipedia comments, we use a 'narrow' convolution layer, with kernels sliding (stride 1) over (entire) embeddings of word n-grams of sizes n = 1, . . . , 4. We use 300 kernels for each n value, a total of 1,200 kernels. The outputs of each kernel, obtained by applying the kernel to the different n-grams of a comment c, are then max-pooled, leading to a single output per kernel. The resulting feature vector (1,200 max- 9 We also tried tf-idf scores in the hsum of da-CENT, instead of attention scores, but preliminary results were poor. 10 We used Keras (http://keras.io/) with the Ten-sorFlow back-end (http://www.tensorflow.org/). 11 See https://nlp.stanford.edu/projects/ glove/. We use 'Common Crawl' (840B tokens). 12 For Gazzetta, words encountered only once in the training set (G-TRAIN-L or G-TRAIN-S) are also treated as OOV. pooled outputs) goes through a dropout layer (Hinton et al., 2012) (p = 0.5), and then to an LR layer, which provides P CNN (reject|c). For Gazzetta, the CNN is the same, except that n = 1, . . . , 5, leading to 1,500 features per comment. All hyperparameters were tuned on the 2% held-out training comments used to tune the other methods. Again, we use 300-dimensional word embeddings, which are now randomly initialized, since tuning indicated this was better than initializing to pretrained embeddings. OOV words are treated as in the RNN-based methods. All embeddings are updated. Early stopping evaluates on the held-out subsets. Again, we use Glorot initialization, crossentropy loss, and Adam. 13

LIST baseline
A baseline, called LIST, collects every word w that occurs in more than 10 (for W-ATT-TRAIN, W-TOX-TRAIN, G-TRAIN-S) or 100 comments (for G-TRAIN-L) in the training set, along with the precision of w, i.e., the ratio of rejected training comments containing w divided by the total number of training comments containing w. The resulting lists contain 10,423, 11,360, 16,864, and 21,940 word types, when using W-ATT-TRAIN, W-TOX-TRAIN, G-TRAIN-S, G-TRAIN-L, respectively. For a comment c, P LIST (reject|c) is the maximum precision of all the words in c.

Tuning thresholds
All methods produce a p = P (reject|c) per comment c. In semi-automatic moderation (Fig. 1), a comment is directly rejected if its p is above a rejection threshold t r , it is directly accepted if p is below an acceptance threshold t a , and it is shown to a moderator if t a ≤ p ≤ t r (gray zone of Fig. 3).
In our experience, moderators (or their employers) can easily specify the approximate percentage of comments they can afford to check manually (e.g., 20% daily) or, equivalently, the approximate percentage of comments the system should handle automatically. We call coverage the latter percentage; hence, 1 − coverage is the approximate 13 We implemented the CNN directly in TensorFlow.
percentage of comments to be checked manually. By contrast, moderators are baffled when asked to tune t r and t a directly. Consequently, we ask them to specify the approximate desired coverage. We then sort the comments of the development set (G-DEV, W-ATT-DEV, W-TOX-DEV) by p, and slide t a from 0.0 to 1.0 (Fig. 3). For each t a value, we set t r to the value that leaves a 1 − coverage percentage of development comments in the gray zone (t a ≤ p ≤ t r ). We then select the t a (and t r ) that maximizes the weighted harmonic mean F β (P reject , P accept ) on the development set: (1 + β 2 ) · P reject · P accept β 2 · P reject + P accept where P reject is the rejection precision (correctly rejected comments divided by rejected comments) and P accept is the acceptance precision (correctly accepted divided by accepted). Intuitively, coverage sets the width of the gray zone, whereas P reject and P accept show how certain we can be that the red (reject) and green (accept) zones are free of misclassified comments. We set β = 2, emphasizing P accept , because moderators are more worried about wrongly accepting abusive comments than wrongly rejecting non-abusive ones. 14 The selected t a , t r (tuned on development data) are then used in experiments on test data. In fully automatic moderation, coverage = 100% and t a = t r ; otherwise, threshold tuning is identical.

Experimental results
Following Wulczyn et al. (2017), we report in Tables 2-3 AUC scores (area under ROC curve), along with Spearman correlations between system-generated probabilities P (accept|c) and human probabilistic gold labels (Section 2.2) when probabilistic gold labels are available. 15 A first observation is that increasing the size of the Gazzetta training set (G-TRAIN-S to G-TRAIN-L, Table 2) significantly improves the performance of all methods; we do not report DETOX results for G-TRAIN-L, because its implementation could not handle the size of G-TRAIN-L. Tables 2-3 14 More precisely, when computing F β , we reorder the development comments by time posted, and split them into batches of 100. For each ta (and tr) value, we compute F β per batch and macro-average across batches. The resulting thresholds lead to F β scores that are more stable over time. 15 When computing AUC, the gold label is the majority label of the annotators. When computing Spearman, the gold label is probabilistic (% of annotators that accepted the comment). The decisions of the systems are always probabilistic.  also show that RNN is always better than CNN and DETOX; there is no clear winner between CNN and DETOX. Furthermore, a-RNN is always better than RNN on Gazzetta comments (Table 2), but not always on Wikipedia comments (Table 3). Another observation is that da-RNN is always worse than a-RNN (Tables 2-3), confirming that the hidden states of the RNN are a better input to the attention mechanism than word embeddings. The performance of da-RNN deteriorates further when equal attention is assigned to the hidden states (eq-RNN), when the weighted sum of hidden states (h sum ) is replaced by the weighted sum of word embeddings (da-CENT), or both (eq-CENT). Also, da-CENT outperforms eq-CENT, indicating that the attention mechanism improves the performance of simply averaging word embeddings. The Wikipedia subsets are easier (all methods perform better on Wikipedia subsets, compared to Gazzetta). Figure 4 shows F 2 (P reject , P accept ) on G-TEST-L, G-TEST-S, W-ATT-TEST, W-TOX-TEST, when t a , t r are tuned on the corresponding development tests for varying coverage. For the Gazzetta datasets, we show results training on G-TRAIN-S (solid lines) and G-TRAIN-L (dashed). The differences between RNN and a-RNN are again small, but it is now easier to see that a-RNN is overall better. Again, a-RNN and RNN are better than CNN and DETOX, and the results improve with a larger training set (dashed). On W-ATT-TEST and W-  TOX-TEST, a-RNN obtains P accept , P reject ≥ 0.94 for all coverages (Fig. 4, call-outs). On the more difficult Gazzetta datasets, a-RNN still obtains P accept , P reject ≥ 0.85 when tuned for 50% coverage. When tuned for 100% coverage, comments for which the system is uncertain (gray zone) cannot be avoided and there are inevitably more misclassifications; the use of F 2 during threshold tuning places more emphasis on avoiding wrongly accepted comments, leading to high P accept (≥ 0.82), at the expense of wrongly rejected comments, i.e., sacrificing P reject (≥ 0.56). On the re-moderated G-TEST-S-R (similar diagrams, not shown), P accept , P reject become 0.96, 0.88 for coverage 50%, and 0.92, 0.48 for coverage 100%. Napoles et al. (2017b) developed an annotation scheme for online conversations, with 6 dimensions for comments (e.g., sentiment, tone, offtopic) and 3 dimensions for threads. The scheme was used to label a dataset, called YNACC, of 9.2K comments (2.4K threads) from Yahoo News and 16.6K comments (1K threads) from the Internet Argument Corpus (Walker et al., 2012;Abbott et al., 2016). Abusive comments were filtered out, hence YNACC cannot be used for our purposes, but it may be possible to extend the annotation scheme for abusive comments, to predict more fine-grained labels, instead of 'accept' or 're- to assess the quality of a thread without processing the texts of its comments. Diakopoulos (2015) discusses how editors select high quality comments. In further work, Napoles et al. (2017a) aimed to identify high quality threads. Their best method converts each comment to a comment embedding using DOC2VEC (Le and Mikolov, 2014). An ensemble of Conditional Random Fields (CRFs) (Lafferty et al., 2001) assigns labels (from their annotation scheme, e.g., for sentiment, off-topic) to the comments of each thread, viewing each thread as a sequence of DOC2VEC embeddings. The decisions of the CRFs are then used to convert each thread to a feature vector (total count and mean marginal probability of each label in the thread), which is passed on to an LR classifier. Further improvements were observed when additional features were added, BOW counts and POS n-grams being the most important ones. Napoles et al. (2017a) also experimented with a CNN, similar to that of Section 3.3, which was not however a topperformer, presumably because of the small size of the training set (2.1K YNACC threads). Djuric et al. (2015) experimented with 952K manually moderated comments from Yahoo Finance, but their dataset is not publicly available. They convert each comment to a DOC2VEC embedding, which is fed to an LR classifier. No-bata et al. (2016) experimented with approx. 3.3M manually moderated comments from Yahoo Finance and News; their data are also not available. 16 They used Vowpal Wabbit 17 with character n-grams (n = 3, . . . , 5) and word n-grams (n = 1, 2), hand-crafted features (e.g., comment length, number of capitalized or black-listed words), features based on dependency trees, averages of WORD2VEC embeddings, and DOC2VEClike embeddings. Character n-grams were the best, on their own outperforming Djuric et al. (2015). The best results, however, were obtained using all features. By contrast, we use no handcrafted features and parsers, making our methods easily portable to other domains and languages. Wulczyn et al. (2017) experimented with character and word n-grams, based on the findings of Nobata et al. (2016). We included their dataset and moderation system (DETOX) in our experiments. Wulczyn et al. also used DETOX (trained on W-ATT-TRAIN) as a proxy (instead of human annotators) to automatically classify 63M Wikipedia comments, which were then used to study the problem of personal attacks (e.g., the effect of allowing anonymous comments, how often personal attacks were followed by moderation actions). Our methods could replace DETOX in studies of this kind, since they perform better. Waseem et al. (2016) used approx. 17K tweets annotated for hate speech. Their best method was an LR classifier with character n-grams (n = 1, . . . , 4) and a gender feature. Badjatiya et al. (2017) experimented with the same dataset using LR, SVMs (Cortes and Vapnik, 1995), Random Forests (Ho, 1995), Gradient Boosted Decision Trees (GBDT) (Friedman, 2002), CNN (similar to that of Section 3.3), LSTM (Greff et al., 2015), FastText (Joulin et al., 2017). They also considered alternative feature sets: character n-grams, tfidf vectors, word embeddings, averaged word embeddings. Their best results were obained using GBDT with averaged word embeddings learned by the LSTM, starting from random embeddings. Warner and Hirschberg (2012) aimed to detect anti-semitic speech, experimenting with 9K paragraphs and a linear SVM. Their features consider windows of up to 5 tokens, the tokens of each window, their order, POS tags, Brown clusters etc., following Yarowsky (1994). Cheng et al. (2015) predict which users would be banned from on-line communities. Their best system uses a Random Forest or LR classifier, with features examining readability, activity (e.g., number of posts daily), community and moderator reactions (e.g., up-votes, number of deleted posts). Lukin and Walker (2013) experimented with 5.5K utterances from the Internet Argument Corpus (Walker et al., 2012;Abbott et al., 2016) annotated with nastiness scores, and 9.9K utterances from the same corpus annotated for sarcasm. 18 In a bootstrapping manner, they manually identified cue words and phrases (indicative of nastiness or sarcasm), used the cue words to obtain training comments, and extracted patterns from the training comments. Xiang et al. (2012) also employed bootstrapping to identify users whose tweets frequently or never contain profane words, and collected 381M tweets from the two user types. They trained decision tree, Random Forest, or LR classifiers to distinguish between tweets from the two user types, testing on 4K tweets manually labeled as containing profanity or not. The classifiers used topical features, obtained via LDA (Blei et al., 2003), and a feature indicating the presence of at least one of approx. 330 known profane words. Sood et al. (2012a;2012b) experimented with 6.5K comments from Yahoo Buzz, moderated via crowdsourcing. They showed that a linear SVM, representing each comment as a bag of word bigrams and stems, performs better than word lists. Their best results were obtained by combining the SVM with a word list and edit distance. Yin et al. (2009) used posts from chat rooms and discussion fora (<15K posts in total) to train an SVM to detect online harassment. They used TF-IDF, sentiment, and context features (e.g., similarity to other posts in a thread). 19 Our methods might also benefit by considering threads, rather than individual comments. Yin et al. point out that unlike other abusive content, spam in comments or discussion fora (Mishne et al., 2005;Niu et al., 2007) is off-topic and serves a commercial purpose. Spam is unlikely in Wikipedia discussions and extremely rare so far in Gazzetta comments. Mihaylov and Nakov (2016) identify comments posted by opinion manipulation trolls. Dinakar et al. (2011) and Dadvar et al. (2013) detect cyberbullying. Chandrinos et al. (2000) detect pornographic web pages, using a Naive Bayes classifier with text and image features. Spertus (1997) flag flame messages in Web feedback forms, using decision trees and hand-crafted features. A Kaggle dataset for insult detection is also available. 20 It contains 6.6K comments (3,947 train, 2,647 test) labeled as insults or not. However, abusive comments that do not directly insult other participants of the same discussion are not classified as insults, even if they contain profanity, hate speech, insults to third persons etc.

Conclusions
We experimented with a new publicly available dataset of 1.6M moderated user comments from a Greek sports news portal and two existing datasets of English Wikipedia talk page comments. We showed that a GRU RNN operating on word embeddings outperforms the previous state of the art, which used an LR or MLP classifier with character or word n-gram features. It also outperforms a vanilla CNN operating on word embeddings, and a baseline that uses an automatically constructed word list with precision scores. A novel, deep, classification-specific attention mechanism improves further the overall results of the RNN. The attention mechanism also improves the results of a simpler method that averages word embeddings. We considered both fully automatic and semi-automatic moderation, along with threshold tuning and evaluation measures for both.
We plan to consider user-specific information (e.g., ratio of comments rejected in the past) and thread statistics (e.g., thread depth, number of revisiting users) (Dadvar et al., 2013;Lee et al., 2014;Cheng et al., 2015;Waseem and Hovy, 2016). We also plan to explore character-level RNNs or CNNs , for example to produce embeddings of unknown or obfuscated words from characters (dos Santos and Zadrozny, 2014;Ling et al., 2015). We are also exploring how the attention scores of a-RNN can be used to highlight 'suspicious' words or phrases when showing gray comments to moderators.