INF-HatEval at SemEval-2019 Task 5: Convolutional Neural Networks for Hate Speech Detection Against Women and Immigrants on Twitter

In this paper, we describe our approach to detect hate speech against women and immigrants on Twitter in a multilingual context, English and Spanish. This challenge was proposed by the SemEval-2019 Task 5, where participants should develop models for hate speech detection, a two-class classification where systems have to predict whether a tweet in English or in Spanish with a given target (women or immigrants) is hateful or not hateful (Task A), and whether the hate speech is directed at a specific person or a group of individuals (Task B). For this, we implemented a Convolutional Neural Networks (CNN) using pre-trained word embeddings (GloVe and FastText) with 300 dimensions. Our proposed model obtained in Task A 0.488 and 0.696 F1-score for English and Spanish, respectively. For Task B, the CNN obtained 0.297 and 0.430 EMR for English and Spanish, respectively.


Introduction
With the growth of users in social networks, there was also an increase in the odious activities that permeate these communicative structures. According to Nockleby et al. (2000), hate speech can be defined as any communication that deprecates a person or a group based on some characteristics such as race, color, ethnicity, gender, nationality, religion or other features. And the main motive that encourages users to spread hate on social networks is anonymity, so users can spread hate words to a particular target. For this reason, the hatred propagated can generate irreversible consequences, where young people who approach with cyberbullying and homophobia, mainly, commit suicide.
Nowadays, social networks like Twitter 1 , Fa-1 https://twitter.com/ cebook 2 and YouTube 3 are pressured to develop tools to fight the proliferation of hate in their networks. A good example of this is the German government that threatened to fine social networks by up to 50 million euros if they did not fight the spread of hate (Gambäck and Sikdar, 2017). However, while there is plenty of available content on social networks, the task of detecting hate speech remains difficult, largely because of the use of different sets of data for work, lack of benchmarking, and efficient approaches. Waseem, for example, bring a study focused on the detection of racism and sexism, whereas Nahar et al. 2012 andSanchez andKumar 2011) conducted a survey on detecting bullying. For the detection of homophobia, misogyny and xenophobia, the number of papers is still limited, one can cite a recent paper (Sanguinetti et al., 2018), where the authors sought to identify hate speech against immigrants. However, it is important that new research is publicized, because only in this way will it be possible to fight against hate in social networks.
Introducing a brief definition of hate speech and the importance of combating it, SemEval-2019 proposed a task in which it challenges participants to develop systems for detecting hate speech against women and immigrants on Twitter from a multilingual perspective , for English and Spanish.
The task was articulated around two related subtasks for each one of the languages involved: a basic task about hate speech, and another where refined hate content resources will be investigated to understand how existing approaches can handle the identification of especially dangerous forms of hatred, that is, those in which incitement is directed against an individual rather than against a group of people, and where aggressive behavior of the perpetrator can be identified as a prominent feature of the expression of hatred. In order to reach this goal, this work proposed to develop a Convolutional Neural Network with the use of word embeddings.
The paper is organised as follows: previous work on hate-speech identification is discussed in Section 2. Section 3 presents details about the task, data sets and evaluation methods. Section 4 describes the methodology for categorizing hate speech based on deep learning, while experiments and results are reported in Section 5. Finally, Section 6 summarises the discussion.

Related Work
Some computational methods to detect hate speech are presented in this section. An example is the work of Badjatiya et al. (2017) that applied several algorithms of machine learning and deep learning, among them: Logistic Regression, Support Vector Machine, Random Forest, Gradient Boosted Decision (GBDT), CNN and Long Short-Term Memory (LSTM). As a baseline they used char n-grams and bag-of-words, and as word embeddings they used GloVe and FastText. The objective was to classify if a tweet is racist, sexist or none, and the best result was 0.930 of F1-score, which was obtained through a LSTM model with Random Embedding and GBDT.
Another work that also follows a line of ternary classification was proposed by Malmasi and Zampieri (2017), where the purpose is to classify a tweet as hateful, offensive (but not hateful) or offensive language. For this, the researchers proposed an approach based on n-grams and word skipgrams using Support Vector Machine with crossvalidation, the best result achieved was 0.78 of accuracy.
Gambäck and Sikdar (2017) developed a Convolutional Neural Network to classify hate speech on Twitter. In this case, the authors used 4 categories: racism, sexism, both (racism and sexism) and non-hate-speech. The structure of CNN was constructed with convolutional layers and pooling of 4 modes: character 4-grams, word vector based on semantic information built using word2vec (Mikolov et al., 2013a), randomly generated word vectors and word embeddings with character n-grams. In the classification phase, the softmax function and cross-validation with 10-folds were applied, the model based on word2vec embeddings best performed with 0.783 of F-score.
A recent study developed by Gaydhani et al. (2018) sought to address the difference between offensive language and hate speech, then the authors proposed several machine learning models based on n-grams and TF-IDF. The models were analyzed considering several n-values in n-grams and TF-IDF normalization methods. Consequently, the best result among several approaches was 0.956 of accuracy.

SemEval-2019 Task 5
In this section we will describe some details about data sets, tasks, and evaluation methods.

Task A
The task A is a two-class classification problem in which participants have to predict whether a tweet, in English or Spanish, with a particular target (women or immigrants) is hateful or not hateful -Hate Speech (1/0).

Task B
The purpose of this task is to: (i) classify hate tweets into English and Spanish, where tweets with hate speech, against women or immigrants, were identified as aggressive or non-aggressive, and (ii) identify the harassed target as just one person or group of individuals.

Evaluation
For the results evaluation of both tasks A and B, different metrics were used in order to allow more refined conclusions.
Task A. The systems will be evaluated according to the following metrics: accuracy, precision, recall and F1-score. The equations below show how the calculations are done. In the case of this task, the scores will be classified by F1-score. For better understanding, we will show the following definitions: • True positive (TP): means a correct classification as odious. For example, the royal class is hateful and the model ranks as hateful.
• True negative (TN): means a correct classification as not hateful. For example, the royal class is not hateful and the model ranked as not hateful.
• False positive (FP): means a wrong classification as odious. For example, the royal class is not hateful and the model rated it as hateful.
• False negative (FN): means a wrong classification not hateful. For example, the royal class is hateful and the model ranked as not hateful.
Accuracy = T P + T N T P + F N + F P + T N (1) Task B. In this task, the evaluation metrics are two: partial match and exact match. The strategy for the partial match is to evaluate the Hate Speech, Target Range and Agressiveness classes independently of each other using the metrics defined above. However, each system will include all measures and a summary of the performance in terms of macro-average F1-score, calculated according to the Equation 5. The exact match considers the predicted classes together, thus computing the Exact Match Ratio (Kazawa et al., 2005). Given the set of data consisting of n multi-label samples (Xi, Yi), where Xi denotes the i-th instance and Yi corresponds to the labels to be predicted (HS, TR and AG), the Exact Match Ratio (EMR) is calculated according to Equation 6.
where Zi denotes the set of labels predicted for the i-th instance and I is the indicator function.

Methodology
In this section, we describe the details of our proposed methods, including data preprocessing, neural networks and word embeddings.

Preprocessing
This step consists in eliminating noises and terms that have no semantic significance in classes prediction. For this, we performed the removal of links, numbers, special characters, and stop words (words with low discriminative power, for example, "is", "that" etc.) and standardized in lowercase.

Word embeddings
Word Embeddings (Bengio et al., 2003) is a supervised statistical language model trained using deep neural networks. The purpose of this language model is to predict the next word, given the previous words of the sentence. The vector embeddings was a great advance in relation to the strategies based on the bag-of-words, which justifies its use in several works (Nakov et al., 2016;Poria et al., 2015;Cliche, 2017;Zhou et al., 2018;Rotim et al., 2017). For the proposed task, we use the GloVe (Pennington et al., 2014) and FastText  model with 300 dimensions.
For the English language, we made use of the Stanford pre-trained GloVe (Pennington et al., 2014) where word embedding were trained with Wikipedia 2014 and Gigaword 5, while Fast-Text  was trained in Wikipedia 2017, UMBC webbase corpus and statmt.org news.
For the Spanish language, the GloVe vocabulary was computed from SBWC (Pennington et al., 2014;Cardellino, 2016), while FastText was computed from the Spanish Wikipedia .

Convolutional Neural Networks
Initially, the Convolutional Neural Network architecture was designed for image processing, however it has been commonly used in the sentiment analysis (Wang et al., 2016;Cambria et al., 2016;Rosenthal et al., 2017;dos Santos and Gatti, 2014;Poria et al., 2015).
For the purpose of the task, a CNN was implemented based on the architecture proposed by (Zhang and Wallace, 2015), and this implementation can be divided in two steps: feature extraction and classification. In the feature extraction step, only two layers of convolution and two layers of pooling were used, with tanh activation function. Four filters were used, 2 of 3 dimensions and 2 of 4 dimensions. Each filter refers to the classic n-grams technique (extremely used in bagof-words based models), which consists of processing a group of n words, in order to consider not only isolated words in a tweet, but also the context in which they are inserted. The filters are applied under the vector representation of the input tweets (embedding layer), using the concept of Back propagation to adjust the weights dynamically. According to Poria et al. (2016), these filters can extract lexical, syntactic or semantic features automatically. Finally, the two layers of convolution and pooling are concatenated and directed to the next step.
In the classification stage, we used two dense layers, the first one has 512 neurons, relu function of activation and dropout of 0.5. The second has a neuron and sigmoid activation function, phase where classification occurs. For the training, the loss and optimization functions used were binary crossentropy and RMSprop (Hinton et al., 2012) (with learning rate 0.001), respectively.

Results
In this section we will discuss the results obtained by using a CNN for the detection of hate speech and the target of hate.  We obtained 0.488 of F1-score for English and 0.696 for Spanish with our CNN model using word embeddings, as shown in Table 2. This result also suggests that the combination of CNN and GloVe provides better results for this task.  The Table 3 displays the results of F1-score, Precision and Recall reached by class for each language. The F1-score can be used to measure the performance of the classifier, in this case CNN ranked the hateful class better, obtaining 0.617 of F1-score, while the result for not hateful class was 0.359 of F1-score in the English language.
From the perspective of the Spanish language, CNN obtained good results in the classification of both classes, hateful and not hateful, with 0.685 F1-score and 0.707 F1-score, respectively.  Recapitulating the idea of task B, where the goal is to identify the target of the hate speech, that is, whether it is a single person or a group of individuals. Knowing that there is hate speech in the tweet (HS is 1), then one must detect if the target is only one person (TR is 1) or if it is a group of individuals (TR is 0), and if there is presence of aggressiveness in speech (AG is 1) or not (AG is 0). In this case, the EMR measure shows a percentage in which it corresponds to an accuracy rate, that is, it measures how much the model has managed not only to classify the hate speech, but also the target and the aggressiveness. The Table 4 shows the results of task B, where it was possible to obtain 0.297 EMR for English and 0.430 EMR for Spanish.

Conclusion
In this paper, we introduced the system that we proposed for SemEval-2019, task 5. Our goal was to experience an architecture that was adapted from a CNN using word embeddings. The task was to detect hate speech against women and immigrants on Twitter from a multilingual perspective, English and Spanish. We participate in two subtasks directed to the two languages, and we obtain the 18th position in the ranking of task A and the 19th position of task B in the English language. In the Spanish language, we obtain the 24th position in the ranking for both tasks.
The success of deep learning depends on finding an architecture to fit the task. Furthermore, as deep learning has scaled up to more challenging tasks, the architectures have become difficult to design by hand. In this paper, a CNN was implemented based on the architecture proposed by Zhang and Wallace 2015 and a fine-tuning of hyperparameters was not done for the proposed tasks (tasks A and B). In addition, other features were not exploited as sarcasm and irony, inherent in this type of domain. We intend to explore these and other features in future work.
Another discussion can be raised regarding the best performance to have happened in Spanish. The main hypothesis is related to the nature of the corpus used. It is observed that the test set of Spanish is smaller than that of English, besides being a corpus with "simpler texts to be classified" (Spanish texts have few signs of sarcasm). Such analyzes need further studies and will be evaluated in future work.
For future work as well, it would be interesting to explore systems that use different parameters for CNN and other word embeddings, such as Word2Vec (Mikolov et al., 2013b). It would also be interesting to construct an LSTM with attention mechanism proposed by Lin et al. (2017) and compare its performance.