Tw-StAR at SemEval-2019 Task 5: N-gram embeddings for Hate Speech Detection in Multilingual Tweets

In this paper, we describe our contribution in SemEval-2019: subtask A of task 5 “Multilingual detection of hate speech against immigrants and women in Twitter (HatEval)”. We developed two hate speech detection model variants through Tw-StAR framework. While the first model adopted one-hot encoding ngrams to train an NB classifier, the second generated and learned n-gram embeddings within a feedforward neural network. For both models, specific terms, selected via MWT patterns, were tagged in the input data. With two feature types employed, we could investigate the ability of n-gram embeddings to rival one-hot n-grams. Our results showed that in English, n-gram embeddings outperformed one-hot ngrams. However, representing Spanish tweets by one-hot n-grams yielded a slightly better performance compared to that of n-gram embeddings. The official ranking indicated that Tw-StAR ranked 9th for English and 20th for Spanish.


Introduction
Under the guise of free speech, social media systems have been misused by some users who embed hatred, offensive, racist or negative stereotyping contents within their shared posts. Unfortunately, online Hate Speech (HS) is spreading widely, forming a serious problem that can lead to actual hate crimes (Matsuda, 2018). Many countries adopted laws prohibiting HS where people convicted of using HS can face large fines and even imprisonment. Although Twitter has its anti HS policy * , the increasing size of the daily-shared tweets in addition to multilingualism and informal writing issues evoke the necessity for automatic HS detection in tweets.
Hate speech detection problem has been addressed as a machine learning classification task. Recent studies proposed multiple HS detection models with different characteristic in terms of features, classification algorithms and implementation architectures. While some HS models employed hand-crafted features generated by NLP tools and external semantic resources, other models adopted text embedding features that are automatically learned from the corpus itself. Both feature types were fed to train either traditional classifiers such as Support Vector Machines (SVM), Naive Bayes (NB) and so forth, or more complicated deep learning-based classifiers such as Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and Recurrent Neural Network (RNN) (Schmidt and Wiegand, 2017). The variety of hand-crafted features enabled obtaining reliable performances. However, generating such features based on morphological NLP tools or semantic resources remains laborious. In contrast, embedding features are easier to generate and can yield good HS classification results when used within deep learning architectures (Yuan et al., 2016). Nevertheless, producing good performances via deep neural systems requires providing large-sized labeled training data, tuning many hyper parameters and high computation/time cost. In line with Tw-StAR framework (Mulki et al., 2017(Mulki et al., , 2018a, we propose, here, an HS model based on the hypothesis that, pairing between ngram embeddings and less-complicated architectures i.e. feedforward neural network can lead to an efficient HS detection with least complexity.

Hand-Crafted-based Models
Being a user-generated content, HS terms tend to have variant writing shapes. (Waseem and Hovy, 2016) handled this issue by using char-grams to train an LR classifier. Combining char-grams with extra linguistic features such as word n-grams and user's gender improved the performance.
Additional user-related features were studied in (Unsvåg and Gambäck, 2018) within a multilingual HS detection task. Single and combined features were fed into an LR classifier. The study showed that specific user features favorably impact the performance.
The winning system (Pamungkas et al., 2018) in misogyny detection contest (Fersini et al., 2018) examined several sets of hand-crafted features including stylistic, lexical and structural. The features were formulated within one-hot/sparse encoding vectors and fed into an SVM classifier. It was noted that using features from HurtLex lexicon (Bassignana et al., 2018) enriched the lexical features set and enhanced the performance.

Text Embeddings-based Models
In these models, the input text is represented using dense, low-dimentional and real-valued vectors. In (Nobata et al., 2016), a comprehensive comparison was conducted among three embedding feature types: doc2vec (Le and Mikolov, 2014), word2vec and pretrained word embeddings against hand-crafted features. Using a regression model trained with the previous features, it could be noted that while doc2vec embeddings outperformed the other embedding features, combining them with all other features could further enhance the HS content recognition. (Badjatiya et al., 2017) explored CNN, LSTM and FastText models to learn embedding features needed to classify HS contents. These models were trained by embedding features and evaluated against each other and towards SVM, LR and GBDT classifiers trained with hand-crafted features. Moreover, the authors explored training an GDBT classifier with word embeddings learned via various deep models. While CNN was the bestperforming deep model, using the word embeddings learned via LSTM to train the simple lesscomplicated GDBT classifier improved the results.
In (Gambäck and Sikdar, 2017), context-aware word embeddings learned by word2vec, char 4grams and a combination of both were used to train a CNN-based classifier. The proposed model was compared with an LR classifier trained via n-gram features (Waseem and Hovy, 2016). The results showed that regardless of the used embeddings type, CNN model outperformed the baseline model. Moreover, word2vec embeddings were of the best classification performance among the other embedding features.

Tw-StAR HS Detection Model
To detect HS in English and Spanish datasets provided by (Basile et al., 2019), Tw-StAR (see Figure 1) was applied through the following steps:
• Stopwords removal: for English and Spanish, we removed stopwords using 1,012 English stopwords and 731 Spanish stopwrods derived from Terrier package † and snowball ‡ , respectively.
• Lemmatization: we adopted Treetagger lemmatizer (Schmid, 1999); as it was used successfully for English and Spanish in (Mulki et al., 2018a). TreeTagger forms a languageindependent tool to annotate texts with partof-speech and lemma information.
• Hate indicatives tagging: Multi-word terms (MWT) are meaning indicators of a sentence/document (Henry et al., 2018;Bechikh-Ali et al., 2019). In our case, they can represent the entities discussed within a tweet. As our objective is to infer HS in tweets, we believe that recognizing MWT can assist in identifying the important entities related to hate speech or victims of hate speech. This has been practically noticed among the MWT extracted from the training set as we can mention: african migrant, Iraqi refugee terrorist, Muslim refugee, immigration negative effect. It should be noted that, MWT were extracted from hate tweets contained in the training set. Later, the extracted MWT were replaced in both training † https://bitbucket.org/kganes2/ ‡ http://snowball.tartarus.org/ Figure 1: Tw-StAR framework and dev/test sets with the tag "HateWord". MWT identification process was performed through two steps: (a) Shallow syntactic parsing where each word was tagged with its syntactic category using Treetagger that supports English and Spanish, and (b) MWT extraction conducted based on specific syntactic patterns of noun and adjective combinations using this schema: where * denotes a list of 0 or more elements, the MWT length varies between 2 and 4 words. Adj, N and NP refer to adjective, noun and proper noun, respectively.

Feature Extraction
Two types of features were generated to train both model variants of Tw-StAR.
• One-hot n-grams: are generated by subjecting the preprocessed tweets to tokenization. Three N-grams schemes including unigrams, bigrams and trigrams were adopted. For a certain n-grams scheme, a tweet's feature vector is constructed via examining the presence/absence of this scheme among the tweet's tokens. Thus, the feature vectors are formulated as one-hot encoding vectors with binary values "1" (presence) or "0" (absence). Term frequency (TF) property was employed to reduce the features size according to predefined frequency thresholds.
• N-gram embeddings: Based on word embeddings initialized randomly at the embedding layer of Tw-StAR Feedforward neural model, n-gram embeddings are produced by applying a composition function over a specific number of word embedding vectors. In our experiments, we used the additive composition function, known as Sum Of Word Embeddings (SOWE). While composing an ngram embedding vector, by performing an element-wise sum over word embedding vectors, SOWE considers the co-occurrence information of the n-gram words and totally ignores the local word order.

Hate Speech Classification
Using the generated features a Naive Bayes (NB) classifier and a feedforward neural network model were trained: • Naive Bayes model: with one-hot n-gram features, we used an NB classifier implemented as a multinomial NB decision rule together with binary-valued features.
• Feedforward neural network : this model was developed with the following layers: -Embeddings layer receives the n-grams generated for each input tweet and map their constituent words into their corresponding word dense representations. N-grams are produced by going through the tweet using a sliding window of a fixed size (N) such that each word of the tweet is considered. All the resulting n-grams (shingles) are then fed to the model with supervision information included where each n-gram is associated with 2-dimension labels HS [1,0] or NOT [0,1] that represent the polarity of the tweet from which the n-gram is derived. -Lambda layer composes n-gram embeddings by applying SOWE over the word embeddings resulting from the embedding layer. -Hidden layer introduces non-linear discriminating features to the model with Relu activation function. -Output layer is equipped with a softmax function to induce the estimated probabilities of each n-gram output label (HS/NOT). Considering the whole tweet, HS scores and NOT scores predicted for all n-grams of the tweet are summed, then each of which is divided by the number of n-grams, contained in a tweet, yielding two values for the potential HS and NOT scores of the tweet. The label of the tweet is, thus, decided according to the greater among these two values.

Results and Discussion
Having the data preprocessed and hate indicatives specified and tagged in both training and dev/test sets, two HS models were used.
The first model is an NB classifier from NLTK § trained with one-hot n-gram features. We generated three n-gram schemes: unigrams (uni), unigrams+bigrams (uni+bi) and uni-grams+bigrams+trigrams (uni+bi+tri). NB was first trained using all n-gram features, then by a reduced number of features obtained via term frequency (TF) with two threshold: 2 and 3. Among several runs with various n-gram schemes and TF values, we adopted the best-performing scheme: uni+bi and TF threshold: 2.
The second model combines n-gram embeddings within a feedforward neural network. The window size 8 was, empirically, selected to produce 8-gram embeddings. Similarly, the embeddings dimension value was set to 100. For training, backpropagation algorithm and "Adam" optimizer (Kingma and Ba, 2014) were used. Table 1 lists the results obtained using Train and Dev sets of English and Spanish tweets where the language, embeddings, average recall, average fmeasure and accuracy are referred to as (Lang.), (emb.), (R.), (F1) and (Acc.), respectively.
Considering Table 1, both feature types performed well for HS detection in English. However, n-gram embeddings were better with an F1 of 94% against 87% scored by one-hot n-grams. We can explain that by the ability of n-gram embeddings to capture the semantic word regularities regardless of the local word order; which is appropriate to handle the informal English used on Twitter; where varying word orders can infer the same semantics (Iyyer et al., 2015).
Regarding the Spanish dataset, while the HS classification performances produced by both fea-  ture types were quite comparable, one-hot ngrams achieved slightly better results with an F1 77% and accuracy of 78% compared to 72% and 72% scored by n-gram embeddings, respectively. This could be attributed to the differences in vocabulary introduced by the different spoken varieties of Spanish found in the tweets (Maier and Gómez-Rodríguez, 2014). Hence, SOWE may miss the synonymous and semantic relations among such different words having same/close semantics which, in turn, leads to less expressive ngram embeddings.
Having the best-performing features identified for English and Spanish, we adopted one-hot ngrams for Spanish and n-gram embeddings for English in the official submission. Table 2 lists the official results of Tw-StAR against the top three ranking systems where (L.), (Acc.), (Eng.), (Sp.), (R.) and (F1) refer to language, accuracy, English, Spanish, recall and f-measure, respectively.
Considering Table 1 and Table 2, we observe that Tw-StAR exhibit a robust performance for the Spanish dataset, while the evaluation measures degraded for the English dataset. We believe that, this could be attributed to the lack of homogeneity between the train/dev and test sets of English data.

Conclusion
We developed two HS detection models for multilingual tweets. With two feature types used, we investigated how likely n-gram embeddings can rival one-hot n-grams in HS detection. Upon training NB and a feedforward neural net with one-hot n-grams and n-gram embeddings, respectively, ngram embeddings exhibited a better performance in English while the vocabulary differences in Spanish made n-gram embeddings less expressive. For future work, we aim to target HS in underrepresented languages such as Arabic and Turkish.