Using Convolutional Neural Networks to Classify Hate-Speech

The paper introduces a deep learning-based Twitter hate-speech text classification system. The classifier assigns each tweet to one of four predefined categories: racism, sexism, both (racism and sexism) and non-hate-speech. Four Convolutional Neural Network models were trained on resp. character 4-grams, word vectors based on semantic information built using word2vec, randomly generated word vectors, and word vectors combined with character n-grams. The feature set was down-sized in the networks by max-pooling, and a softmax function used to classify tweets. Tested by 10-fold cross-validation, the model based on word2vec embeddings performed best, with higher precision than recall, and a 78.3% F-score.


Introduction
During the Spring of 2017, parliamentary committees in Germany and the UK strongly criticised leading social media sites such as Facebook, Twitter and Youtube (Google) for failing to take sufficient and quick enough action against hate-speech, with the German government threatening to fine the social networks up to 50 million euros per year if they continue to fail to act on hateful postings (and posters) within a week (Thomasson, 2017).
When called to witness in front of the UK Home Affairs Committee, all the social media companies refused to reveal both the number of people they employ to battle hate-speech and the amount they spend on this. However, Google claimed to have invested "hundreds of millions" while Facebook stated that they had thousands of people working on the problem. The German government estimated that the companies combined already spend some 50 million euros per year and that the suggested new German law would increase that amount by 50% (CDU/CSU & SPD, 2017, p.14).
Regardless of the resources actually devoted by the social media networks, it is clear that their current efforts are not enough: "we are disappointed at the pace of development of technological solutions" (Home Affairs Committee, 2017, p.24). The UK and German governments also indicate that they are moving in the direction of treating online content providers in analogy with publishers of printed material, with the same obligations to abide to publishing laws.
With legislation in other countries set to follow (Nielsen, 2017), properly identifying hate-speech is a pressing issue, not only for the major players, but also for smaller companies, clubs, and organisations that allow for user-generated content on their sites (albeit the current German law proposal makes an exception for sites with less than 2 million users). Many such sites currently use slow, manual moderation, which mean that abusive posts will be left online for too long without appropriate action being taken or that content will be published with delay (which might be unacceptable to the users, e.g., in online chat rooms).
Following the work by Collobert et al. (2011), deep neural networks have been shown to effectively solve several language processing tasks such as part-of-speech tagging, sentiment analysis, and named entity recognition. Here a Convolutional Neural Network (CNN) model with various features is utilised for hate-speech categorisation. Word vectors based on semantic information are built for all tokens using an unsupervised learning algorithm, word2vec. The word vectors are merged with a set of extracted features, downsized using max-pooling, and together with character n-grams (4-grams) fed to the neural network model to predict the categories of each tweet. The paper is organised as follows: Previous work on hate-speech identification is discussed in Section 2. Section 3 describes the deep learningbased hate-speech categorisation strategy, while experiments and results are reported in Section 4. Finally, Section 5 summarises the discussion.

Related Work
Although the above-noted law-maker interest in the issue is fairly recent, the task of identifying hate speech and abusive language in online content has already been topical in the research community for 20 years. Spertus (1997) built the decision tree-based classifier 'Smokey' which utilised 47 syntactic and semantic sentential features. When trained on a small set of 720 web page posts manually annotated (as "flame", "okay" or "maybe") and evaluated on 502 other messages, 'Smokey' performed well on classifying the non-inflamatory messages, but fell completely short on flame texts (thus obtaining an accuracy of only 88.2% on a task with a majority-class baseline of 86.1%).
Addressing the dataset size problem, Sood et al. (2012) collected 1.6 million comments from a Yahoo! social news site, of which 6,500 were randomly selected for annotation by 221 persons on Amazon Mechanical Turk (AMT). Several Support Vector Machine classifiers were trained on varying-size parts of this dataset using mainly word n-gram features, indicating that classification performance kept improving with increased datasets, but not as rapidly after the data size had passed 1,500 items. Looking at another set of AMT-annotated Yahoo! news posts, Nobata et al. (2016) experimented with several different wordinternal, n-gram-based, syntactic, and distributional semantic features, concluding that character n-grams alone contribute sufficiently strongly for an online gradient descent learner to perform well on this type of data.
Moving away from features based solely on the language used in online messages, Chen et al. (2012) proposed a model also taking into account the posting patterns of the users in order to single out persons exhibiting abusive behaviour. Similarly, Buckels et al. (2014) aimed to extract traits from online user behaviour that would indicate antisocial personality. This is of particular importance for swift moderation of online chat rooms, as addressed by, e.g., Yin et al. (2009) and Papegnies et al. (2017), with the latter suggesting several types of features (at the morphological, syntactic and user behaviour levels) that can be used for identifying when gamers on a French MMO (massively multiplayer online) game site move from discussing game-related issues to posting personal inflammatory remarks.
Of particular relevance to the present work are previous efforts on identifying abusive language on Twitter. Xiang et al. (2012) created offensivelanguage topic clusters using Logistic Regression over a set of 860,071 tweets automatically annotated using a boot-strapping technique and supplemented with a dictionary of 339 offensive words. When tested on 4,029 randomly selected tweets collected just after the training set, the lexicon-enhanced clustering outperformed a keyword matching baseline. Logistic Regression and a dictionary was also utilised by Davidson et al. (2017); however, they used crowd-sourcing to create their hate-speech dictionary and aimed to separate the tweets into three classes: hate-speech, offensive language, and neither. Working on a set of 24,802 manually labelled tweets, they achieved good recall and precision overall, but noted that almost 40% of the actual hate-speech tweets were misclassified, although with 3/4 of those being mistaken for offensive language only.
A recurring problem with several of these experiments has been that the annotated datasets have not always been made publically available. However, Ross et al. (2016) had a set of 541 German tweets annotated, in particular addressing the issues of annotator and annotation reliability, and what information should be provided to the annotators. Waseem (2016) discusses similar issues while providing a set of 6,909 English tweets hatespeech annotated by CrowdFlower users, 1 and extending a previous such dataset (Waseem and Hovy, 2016). This dataset will be used in the experiments reported below. Wulczyn et al. (2016) also used CrowdFlower to obtain human annotations of 115,737 comments on Wikipedia as to whether they contained personal attacks and harassment. They furthermore experimented with strategies to automatically expand the dataset, comparing Multi-Layer Perceptrons (a single-hidden-layer neural network) to Logistic Regression, and word n-grams to character n-grams; concluding the Logistic Regression with character n-grams performed best.

CNN-based Hate-Speech Classification
This section describes the hate-speech identification system architecture based on Convolutional Neural Networks (CNN). An overview of the system is shown in Figure 1. The first step of the system is to generate feature embeddings. Feature embeddings for all words were constructed by using word embeddings and character n-grams.
The word embeddings were generated in two ways, through word2vec (Mikolov et al., 2013a,b) and through random vectors. In the random vector setting, all the words in the corpora are initialised with random values. In the word2vec version, word vectors are generated based on the context. There are two types of such embeddings: continuous-bags-of-words (CBOW) and skip-gram models. In the CBOW architecture, the model predicts the current word from a window of surrounding context words. In the skip-gram model, the context words are predicted using the current word.
In addition to the word embeddings, length 28 one-hot character n-gram vectors were generated, with 26 elements for the English alphabet, one for digits, and one for all other characters/symbols. The feature embeddings were produced by concatenating the word embeddings with these character n-gram vectors.
A pooling layer in the network converts each tweet into a fixed length vector, capturing the information from the entire tweet. A max-pooling layer then captures the most important latent semantic factors from the tweets.
On the output side, a softmax layer calculates the class probability distributions for each tweet and assigns the hate-speech classes / labels based on the probability values.

Experiments
Four approaches to hate-speech classification were tested, based on different feature embeddings. All models were applied to the English Twitter hatespeech dataset created by Waseem (2016). 2 Each tweet in the dataset has been annotated by one Expert annotator and three Amateur annotators, with four labels: non-hate-speech (84% of the data), racism, sexism, and both (i.e., racism and sexism).
Waseem (2016) defined the "Expert" annotators as those having both a theoretical and applied knowledge of hate speech (those were recruited among feminist and antiracism activists), while the "Amateur" annotations were obtained by crowd-sourcing (on the CrowdFlower platform). We combined the annotated tags for each tweet based on majority voting, where the Expert was given double unit votes and each of the Amateurs was given a single unit vote.
The class distributions of the dataset are shown in Table 1. The total size of the dataset (6,655 tweets) is slightly lower than the original set (Waseem reported it as containing 6,909 tweets), since some of the annotated tweets were unavailable or had been deleted.

Data
Number of tweets

Results
The average 10-fold cross-validated results for all four Convolutional Neural Network (CNN) models are shown in Table 2, and compared to the Logistic Regression (LogReg) model used by Waseem and Hovy (2016).
In the first CNN model, random word vectors were considered as feature embeddings when training the network.
This baseline model achieved precision, recall and F-score values of 86.68%, 67.26% and 75.63%, respectively, marking a drastic improvement in precision compared to the LogReg model, but at the expense of lower recall. In the second approach, word2vec word vectors were taken as feature embeddings to learn the CNN model, resulting in clearly (7.3%) improved recall, for an F-score of 78.29%, even though the precision actually was slightly reduced compared to using the random vectors.
The third and fourth models both added character n-grams to the input of the CNN model. In line with the experiments reported on the same dataset by Waseem and Hovy (2016), length 4 character n-grams were used. In the third model, only the character n-gram were considered as feature embeddings when training the CNN model, while in the fourth model, the feature embeddings were generated by concatenating word2vec word embeddings and character n-grams. Tested by 10-fold cross-validation, the latter system showed better precision (86.61%) than recall (70.42%), for an F-score of 77.38%.
However, although the character n-grams thus helped a little in improving precision, the word2vec model without character n-grams still achieved the best results of all the compared models, with the precision, recall and F-score values of 85.66%, 72.14% and 78.29%, respectively. Note that all CNN models convincingly outperformed Logistic Regression in terms of both precision and F 1 -score, while the LogReg model achieved better recall than all the neural network models.

Error Analysis
An error analysis was carried out for each of the 10 folds. The confusion matrices are shown in Table 3. It can be observed that the model overall did not identify many tweets as hate-speech tweets. This may be due to insufficient training instances. Furthermore, the system wrongly identified some non-hate-speech tweets as hate-speech.
In particular, the system was not able to identify properly the category 'both', since the examples of this category are very few (1 or 2 per fold) with respect to the whole set of training instances. The system performed better in the 'sexism' category than in the other hate-speech categories ('both' and 'racism') because the number of tweets of this category are larger.

Conclusion and Future Work
Here we have experimented with a system for Twitter hate-speech text classification based on a deep-learning, Convolutional Neural Network model. The classifier assigns each tweet to one of four predefined categories: racism, sexism, both (racism and sexism) and neither.
Two CNN models were created based on different input vectors sets that were fed to the neural networks for training and classification. Word vectors based on semantic information were built using an unsupervised strategy, word2vec, and compared to a randomly generated vector baseline. In additional, two CNN models were trained on character 4-grams, as well as on a combination of word vectors and character n-grams. The feature set is down-sized in the networks by a max-pooling layer, while a softmax layer is utilised to assign the tweets their most probable label category.
Trained and tested by 10-fold cross-validation, the system based on word2vec word vectors performed best overall, with an F 1 -score of 78.3%. Adding character n-grams slightly increased the precision, but resulted in lower recall and F-score.
The tested models and neural network architectures could be extended in several ways: The word2vec embeddings used here were built on skip-grams that predict the context words using the current word. An alternative would be to use continuous-bags-of-words that basically do the opposite and predict the current word from a window of surrounding context words. Also, following Waseem and Hovy (2016) only length 4 character n-grams were used. Clearly it would be interesting to explore whether these are uniformly ineffective when changing the n-gram size.
The experiments reported here were carried out on a convolutional network architecture, but other types of deep neural networks could obviously be tried. In particular, the bi-directional Long Short-Term Memory (LSTM) recurrent neural network architecture has shown itself to be useful to language processing problems where utilising the sequential nature of the input is more essential, such as named entity recognition and sentiment analysis, although most of the best performing systems in SemEval 2016 (the International Workshop on Semantic Evaluation; Task 4: Sentiment Analysis in Twitter) actually utilised convolutional neural networks or combinations of CNNs and other approaches (Nakov et al., 2016).
A long those lines, Sikdar and Gambäck (2017) report experiments with a set-up for named entity recognition combining an LSTM with a more traditional machine learning classifier based on Conditional Random Fields (CRF). Such an approach could be tested also for the abusive language classification task, either using the LSTM/CRF combination or including CNN.