Grunn2019 at SemEval-2019 Task 5: Shared Task on Multilingual Detection of Hate

Hate speech occurs more often than ever and polarizes society. To help counter this polarization, SemEval 2019 organizes a shared task called the Multilingual Detection of Hate. The first task (A) is to decide whether a given tweet contains hate against immigrants or women, in a multilingual perspective, for English and Spanish. In the second task (B), the system is also asked to classify the following sub-tasks: hateful tweets as aggressive or not aggressive, and to identify the target harassed as individual or generic. We evaluate multiple models, and finally combine them in an ensemble setting. This ensemble setting is built of five and three submodels for the English and Spanish task respectively. In the current setup it shows that using a bigger ensemble for English tweets performs mediocre, while a slightly smaller ensemble does work well for detecting hate speech in Spanish tweets. Our results on the test set for English show 0.378 macro F1 on task A and 0.553 macro F1 on task B. For Spanish the results are significantly higher, 0.701 macro F1 on task A and 0.734 macro F1 for task B.


Introduction
The increasing popularity of social media platforms such as Twitter for both personal and political communication has seen a well-acknowledged rise in the presence of toxic and abusive speech on these platforms (Kshirsagar et al., 2018). Although the terms of services on these platforms typically forbid hateful and harassing speech, the volume of data requires that ways are found to classify online content automatically. The problem of detecting, and therefore possibly limit the hate speech diffusion, is becoming fundamental (Nobata et al., 2016).
Previous work concerning hate speech against immigrants and women such as Olteanu et al. (2018) observed that extremist violence tends to lead to an increase in online hate speech, particularly on messages directly advocating violence. Also, Anzovino et al. (2018) contributed to the research field by (1) making a corpus of misogynous tweets, labelled from different perspective and (2) created an exploratory investigations on NLP features and ML models for detecting and classifying misogynistic language. Basile et al. (2019) proposed a shared task on the Multilingual Detection of Hate, where participants have to detect hate speech against immigrants and women in Twitter, in a multilingual perspective, for English and Spanish. The task is divided in two related subtasks for both languages: a basic task about hate speech, and another one where fine-grained features of hateful contents will be investigated in order to understand how existing approaches may deal with the identification of especially dangerous forms of hate, for example those where the incitement is against an individual rather than against a group of people, and where an aggressive behavior of the author can be identified as a prominent feature of the expression of hate.
Within this experiment, Task A is a binary classification task where our system has to predict whether a tweet is hateful or not hateful. For Task B, our system has to decide whether a tweet is aggressive or not aggressive, and whether that tweet targets an individual or generic group, to elaborate, a single human or group of people.
The paper is structures as follows. In section 2 our system setup is described. In section 3, the datasets together with the preprocessing steps are presented. In section 4, obtained results are detailed. Finally, in section 5 a discussion about the proposed system is outlined.

System Setup
In our approach, we trained multiple classifiers and combined their results into an ensemble model using majority vote.

English Ensemble Setup
The setup of our system optimized for the English classification tasks consisted of the following classifiers: • Random Forest • Support Vector Machine (1) • Support Vector Machine (2) • Logistic Regression

Spanish Ensemble Setup
Due to time restrictions, we used three classifiers for the Spanish tasks: • Random Forest • Support Vector Machine (1)

• Logistic Regression
These time restrictions occurred, because we decided in the last moment to run our system for the Spanish task too. However, we did not have hate speech specific word embeddings, nor trained a BiLSTM model for the Spanish task. Therefore, we decided to run only three classifiers for both the Spanish tasks.

Random Forest (RF)
For our RF model we executed a grid search starting with the following parameters: character ngrams with range: 2-3, 2-4, 1-3, 1-4; word ngrams with range 1, 1-2, 1-3, 1-4 and all combinations of them. In the end we used a tf-idf vectorizer with character n-grams with range 2-4. As for our parameters also following a grid search we used 400 estimators, entropy as our split criterion/estimator, balanced for our class weight and a random seed of 1337. Due to time restrictions, we used the same parameters for the Spanish tasks.

Support Vector Machine (SVM 1)
Within this subpart of our ensemble model, we used a SVM model from the scikit-learn library (Pedregosa et al., 2011). We used a linear kernel, and a weighted class weight. This model used vectorized character n-grams in range 2-4 using a tf-idf vectorizer as its input.

Support Vector Machine (SVM 2)
We used a second SVM classifier within our ensemble model, but this time with word embeddings as its input. This choice is motivated by the hypothesis that introducing different predictions given by models trained differently could lead to more insights. We tested four pre-trained embedding representations, which are the following: the 300-dimensional GloVe embeddings and the 25-dimensional GloVe Twitter representations by Pennington et al. (2014); a 400dimensional and 100-dimensional word embedding created from tweets (Van der Goot). Using the GloVe embeddings proved to be superior within our work. The results of each word embedding can be found in Table 6.

Logistic Regression (LR)
Following our grid search testing our LR model with a tf-idf vectorizer with the following parameters: character n-grams with range: 2-3, 2-4, 1-3, 1-4; word n-grams with range 1, 1-2, 1-3, 1-4 and all combinations of them, we got the best performance using a tf-idf vectorizer with character ngrams with range 2-4. Due to time restrictions, we used the same parameters for the Spanish tasks.

BiLSTM
Our BiLSTM classifier was only optimized for the English classification task. Hence we decided not to use it in our Spanish setup. In combination with the BiLSTM model we used an attention mechanism, as proposed by Yang et al. (2016).
LSTM models can handle input sequentially and therefore can take word order into account. We combine this with a bidirectional model, which allows us to process the tweets both forwards and backwards. For each word in the tweets, the LSTM model combines its previous hidden state and the current word's embedding weight to compute a new hidden state. After using dropout to shut down a percentage of neurons of the model, we feed the information to the at-Hate Speech Target Range Aggressiveness  0  1  0  1  0  1  Trial data  50  50  87  13  80  20  Train data 5217 3783 7659 1341 7440  1559  Dev data  573  427  781  219  796  204  Test data 3000 Total 13100  tention mechanism. This mechanism emphasizes the most informative words in the article and gives these more weight.
Our final model uses 512 units in the hidden layer of the BiLSTM, a batch size of 64, the Adam optimizer in combination with the default learning rate of 0.001 and a dropout of 0.4. We trained our model for 10 epochs, of which we saved the model with the lowest validation loss.

Data and Preprocessing
For this shared task, the data distribution is seen in Table 1 and Table 2 for the train and development data, we assumed the trial data to be train data too. After release of the test data, the distribution would be 69% train, 8% development, and 23% (3000 sentences) test data for the English task, and for the Spanish task 68% train, 8% development, and 24% (1600 sentences) test data. For final submission, we combined the train and development data to train our system on.
The meaning of the binary encoding is as follows, for Hate Speech (HS) and Aggressiveness (AG): 0 or 1, absent and present respectively. For Target Range (TR): 0 or 1, whole group and individual respectively. We notice that there is more data available for the English task than the Spanish one.
With regard to preprocessing, we did this in the following fashion: • Tokenized with the NLTK TweetTokenizer.
• Replaced URLs with a placeholder suitable for our English embeddings.
• Replaced mentions with a placeholder suitable for the available English embeddings (van der Goot and van Noord, 2017).
• Converted words to lowercase.
• Filtered out stopwords using the stopwords from NLTK, either English or Spanish.
For the BiLSTM, we did not do any preprocessing. We deemed this might affect the learning curve of the system, since a BiLSTM algorithm often performs well with lots of different data. So, without preprocessing there will be less loss of information and thus a better performing system.
We tested how the preprocessing affected our scores, results are in Table 3 and Table 4. We used the train and development data available to test the preprocessing. We started using all the preprocessing, and in a cumulative way, excluded a preprocess step one by one. So in the end, we would only have tokenization left.
Interesting is that the scores of the RF and SVM 1 model are higher, for both English and Spanish data, when we exclude preprocessing steps. At the step of replacing URLs and usernames with placeholders, we expected the scores to be higher if excluded. Because if the same URL or username occurs often in the training set, and that specific URL or username is always corresponding with a hateful or non-hateful message, our system could wrongly classify a comment in the development set containing that same URL or username. The scores also increase when we exclude lowercasing or remove single characters in addition to the placeholder steps. However, if we omit either lowercasing or characters alone, the scores do not get better than if we use all preprocessing. This also explains the higher score with the LR model, but if we only disregard the character preprocessing step, the score also does not get better.

Results
In this section, we state our results on the test set, as well as the results of our ensemble model and individual models on the development set. Our final system for the English task consists of all five models shown in the English Ensemble Setup, each given a result being either 0 or 1, and run a majority vote on it for a final result. For the Span-   Table 4: Scores with changes in preprocessing for Spanish, csores in bold means that it was higher than using all preprocessing of the respective system.
ish task, the final system contains three models, described in the Spanish Ensemble Setup. The results on the the various tasks we participated in are listed in Table 5. For the English task, we achieved a much lower accuracy and macro f1score than for the Spanish task. Assuming the data has been distributed fairly for both languages, it could be that the quality of the test data is lower than the train and development data.
These scores were lower in comparison to our results of the individual classifiers on the development set which are listed in table 3 and 4.

Discussion
We compared multiple classification algorithms and combined them into an ensemble model to get a more robust and accurate system. Initially, our system performed reasonably well on the development set, but when tested on the final test set our performance dropped a fair bit. Overall, the drop in performance was to be expected. During the final evaluation of the test set our system predicted over 80% as hate speech. Looking at the data we thought a large part of the remaining 20% could also be classified as hate speech. Also the majority class baseline (Basile et al., 2019) ranked second for accuracy, supporting our expectations.
From our results we can conclude that using   In the future, we would like to try to improve the performance of our Spanish model, of which our development was cut short due to time restrictions. We would also like to test our models with more high quality data. It would be interesting to find out whether this helps to improve our models' performance.