FERMI at SemEval-2019 Task 5: Using Sentence embeddings to Identify Hate Speech Against Immigrants and Women in Twitter

This paper describes our system (Fermi) for Task 5 of SemEval-2019: HatEval: Multilingual Detection of Hate Speech Against Immigrants and Women on Twitter. We participated in the subtask A for English and ranked first in the evaluation on the test set. We evaluate the quality of multiple sentence embeddings and explore multiple training models to evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team - Fermi’s model achieved an accuracy of 65.00% for English language in task A. Our models, which use pretrained Universal Encoder sentence embeddings for transforming the input and SVM (with RBF kernel) for classification, scored first position (among 68) in the leaderboard on the test set for Subtask A in English language. In this paper we provide a detailed description of the approach, as well as the results obtained in the task.


Introduction
Microblogging platforms like Twitter provide channels to exchange ideas using short messages called tweets. While such a platform can be used for constructive ideas, a small group of people can propagate their notions including hatred against an individual, or a group or a race to the entire world in a few seconds. This necessitates the need to come up with computational methods to identify hate speech in user generated content.
Using computational methods to identify offense, aggression and hate speech in user generated content has been gaining attention in the recent years as evidenced in (Waseem et al., 2017;Malmasi and Zampieri, 2017;Kumar et al., 2018) and workshops such as Abusive Language Workshop (ALW) 1 and Work-1 https://sites.google.com/view/alw2018 shop on Trolling, Aggression and Cyberbullying (TRAC) 2 .

Related Work
In this section we briefly describe other work in this area.
A few of the early works related to hate speech detection employed the use of features like bag of words, word and character n-grams with relatively off-the-shelf machine learning classifiers for detection (Dinakar et al., 2011;Waseem and Hovy, 2016;Nobata et al., 2016). Deep learning methods for hate speech detection were used by Badjatiya et al. (2017) wherein the authors experimented with a combination of multiple deep learning architectures along with randomly initialized word embeddings learned by Long Short Term Memory (LSTM) models.
Papers published in the last two years include the surveys by (Schmidt and Wiegand, 2017) and (Fortuna and Nunes, 2018), the paper by  which presented the Hate Speech Detection dataset used in (Malmasi and Zampieri, 2017) and a few other recent papers such as (ElSherief et al., 2018;Gambäck and Sikdar, 2017;Zhang et al., 2018).
A proposal of typology of abusive language sub-tasks is presented in (Waseem et al., 2017). For studies on languages other than English see (Su et al., 2017) on Chinese and (Fišer et al., 2017) on Slovene. Finally, for recent discussion on identifying profanity versus hate speech see . This work highlighted the challenges of distinguishing between profanity, and threatening language which may not actually contain profane language.
Some of the similar and related previous workshops are Text Analytics for Cybersecurity and Online Safety (TA-COS) 3 , Abusive Language Workshop 4 , and TRAC 5 . Related shared tasks include GermEval (Wiegand et al., 2018) and TRAC (Kumar et al., 2018).

Methodology
In this paper, we make use of several word embedding and sentence embedding methods.

Word Embeddings
Word embeddings have been widely used in modern Natural Language Processing applications as they provide vector representation of words. They capture the semantic properties of words and the linguistic relationship between them. These word embeddings have improved the performance of many downstream tasks across many domains like text classification, machine comprehension etc. (Camacho-Collados and Pilehvar, 2018). Multiple ways of generating word embeddings exist, such as Neural Probabilistic Language Model (Bengio et al., 2003), Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and more recently ELMo (Peters et al., 2018).
These word embeddings rely on the distributional linguistic hypothesis. They differ in the way they capture the meaning of the words or the way they are trained. Each word embedding captures a different set of semantic attributes which may or may not be captured by other word embeddings. In general, it is difficult to predict the relative performance of these word embeddings on downstream tasks. The choice of which word embeddings should be used for a given downstream task depends on experimentation and evaluation.

Sentence Embeddings
While word embeddings can produce representations for words which can capture the linguistic properties and the semantics of the words, the idea of representing sentences as vectors is an important and open research problem (Conneau et al., 2017).
Finding a universal representation of a sentence which works with a variety of downstream tasks is the major goal of many sentence embedding techniques. A common approach of obtaining a sentence representation using word embeddings is by the simple and naïve way of using the simple arithmetic mean of all the embeddings of the words present in the sentence. Smooth inverse frequency, which uses weighted averages and modifies it using Singular Value Decomposition (SVD), has been a strong contender as a baseline over traditional averaging technique (Arora et al., 2016). Other sentence embedding techniques include pmeans (Rücklé et al., 2018), InferSent (Conneau et al., 2017), SkipThought (Kiros et al., 2015), Universal Encoder (Cer et al., 2018). Task A (Hate speech detection) is a two-class classification where systems have to predict whether a tweet in English or in Spanish with a given target (women or immigrants) is hateful or not hateful. TASK B (Aggressive behavior and Target Classification) is a two-class classification where systems have to classify hateful tweets (e.g., tweets where Hate Speech against women or immigrants has been identified) as aggressive or not aggressive, and second to identify the target harassed as individual or generic (i.e. single human or group). We formulate sub-task A of HatEval as a text classification tasks. In this paper, we evaluate various pre-trained sentence embeddings for identifying the offense, hate and aggression. We train multiple models using different machine learning algorithms to evaluate the efficacy of each of the pretrained sentence embeddings for the downstream task. We observe that there is a class label imbalance in the dataset. To prevent any bias induced due to imbalanced classes, we process the transformed training dataset using SMOTE (Chawla et al., 2002) which synthetically oversamples data and ensures that all the classes have same number of instances.
In the following, we discuss various popular sentence embedding methods in brief.
• InferSent (Conneau et al., 2017) is a set of embeddings proposed by Facebook. In-ferSent embeddings have been trained using the popular language inference corpus. Given two sentences the model is trained to infer whether they are a contradiction, a neutral pairing, or an entailment. The output is an embedding of 4096 dimensions.
• Concatenated Power Mean Word Embedding (Rücklé et al., 2018) generalizes the concept of average word embeddings to power mean word embeddings. The concatenation of different types of power mean word embeddings considerably closes the gap to state-of-theart methods mono-lingually and substantially outperforms many complex techniques crosslingually.
• Lexical Vectors (Salle and Villavicencio, 2018) is another word embedding similar to fastText with slightly modified objective.
FastText (Bojanowski et al., 2016) is another word embedding model which incorporates character n-grams into the skipgram model of Word2Vec and considers the sub-word information.
• The Universal Sentence Encoder (Cer et al., 2018) encodes text into high dimensional vectors. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector.
• Deep Contextualized Word Representations (ELMo) (Peters et al., 2018) use language models to get the embeddings for individual words. The entire sentence or paragraph is taken into consideration while calculating these embedding representations. ELMo uses a pre-trained bi-directional LSTM language model. For the input supplied, the ELMo architecture extracts the hidden state of each layer. A weighted sum is computed of the hidden states to obtain an embedding for each sentence.
Using each of the sentence embeddings we have mentioned above, we seek to evaluate how each of them performs when the vector representations are supplied for classification with various off-theshelf machine learning algorithms. For each of the evaluation tasks, we perform experiments using each of the sentence embeddings mentioned above and show our classification performance on the dev set given by the task organizers.
Using each of the sentence embeddings we have mentioned above, we seek to evaluate how each of them performs when the vector representations are supplied for classification with various off-theshelf machine learning algorithms. For each of the evaluation tasks, we perform experiments using each of the sentence embeddings mentioned above and show our classification performance on the dev set given by the task organizers.

Dataset
The data collection methods used to compile the dataset used in HatEval is described in (Basile et al., 2019). We did not use any external datasets to augment the data for training our models.

Results and Analysis
The official test set results scored on CodaLab have been presented below in Table 2.

System
F1 (macro) Accuracy Universal Encoder 0.65 0.65 Our results on the different algorithms from the ones stated above have been mentioned henceforth and described in Table 1.
As described in Table 1 the dev set macroaveraged F-1 and accuracy is given for the task A-English.
We notice the best performance for task A in English on the official test set was bagged by the model which used pretrained Universal sentence embeddings using SVM with RBF kernel. However, pretrained Infersent embeddings along with XGBoost algorithm outperformed every other combination on the dev test. This can be probably due to the difference between the distributions in the dev and the official test sets.
Overall, this work shows how different set of pretrained embeddings trained from different state-of-the-art architectures and methods when used with simple machine learning classifiers perform very well in the classification task of categorizing text as offensive or not.

Conclusions and Future Work
It is also important to note that the experiments are performed using the default parameters, so there is much scope for improvement with a lot of finetuning, which we plan on considering for future research purposes. Further, we can explore augmenting data from other similar shared tasks to achieve better performance.