Atalaya at SemEval 2019 Task 5: Robust Embeddings for Tweet Classification

In this article, we describe our participation in HatEval, a shared task aimed at the detection of hate speech against immigrants and women. We focused on Spanish subtasks, building from our previous experiences on sentiment analysis in this language. We trained linear classifiers and Recurrent Neural Networks, using classic features, such as bag-of-words, bag-of-characters, and word embeddings, and also with recent techniques such as contextualized word representations. In particular, we trained robust task-oriented subword-aware embeddings and computed tweet representations using a weighted-averaging strategy. In the final evaluation, our systems showed competitive results for both Spanish subtasks ES-A and ES-B, achieving the first and fourth places respectively.


Introduction
Hate speech against women, immigrants, and many other groups is a pervasive phenomenon on the Internet. On the early days of the World Wide Web, many academics adventured that prejudices and hatred would be removed in this space by the dissolution of identities (Lévy, 2001;Rheingold, 1993). Twenty years after this hypothesis, we can say that it has not been the case. The prevalence of racism in the "World White Web" has been studied in a number of works (Adams and Roscigno, 2005;Kettrey and Laster, 2014) and so has been the misogyny in the virtual world (Filipovic, 2007;Mantilla, 2013).
Racist and sexist discourse are a constant in social media, but peaks are documented after "trigger" events, such as murders with religious or political reasons (Burnap and Williams, 2015). Most social media companies are concerned about this issue and take actions against it; nonetheless, most of the efforts still need human intervention, making this task very expensive. Therefore, reducing human intervention is vital in order to have effective tools to avoid the escalation of hate speech.
HatEval (Basile et al., 2019) is a SemEval-2019 shared task aimed at the detection of hate speech towards immigrants and women in tweets. It comprises two subtasks, with datasets in English (EN) and Spanish (ES) for both of them, giving a total of four subtasks. Subtask A is the binary classification of tweets into hateful or not hateful (HS). Subtask B is a triple binary classification task where, in addition to HS, tweets are classified into aggressive or not aggressive (AG), and targets of hate speech are classified into single humans or groups of persons (TR).
In this article, we present our participation in HatEval as team Atalaya. We focused our efforts on subtask A for Spanish (ES-A) but also worked at subtask B in Spanish (ES-B) and subtask A in English (EN-A). Our systems are based on our participation in the polarity classification task of Spanish tweets TASS 2018 (Sentiment Analysis at SEPLN) (Martínez-Cámara et al., 2018;Luque and Pérez, 2018).
To represent tweets, we experimented with a mixed approach of bag-of-words, bag-ofcharacters and tweet embeddings, which were calculated from word vectors using different averaging schemes. We used fastText (Bojanowski et al., 2016) to get subword-aware representations specifically trained for sentiment analysis tasks.
These word representations are robust to noise since they can be computed for unseen words by using subword embeddings. Moreover, we trained them using a database of 90M tweets from various Spanish-speaking countries, giving wide domainspecific vocabulary coverage. We achieved additional robustness by doing preprocessing using several text-normalization and noise-reduction techniques.
Also, we experimented with ELMo (Peters et al., 2018), a deep contextualized word representation that has drawn a lot of attention in the last months. Unlike fastText, ELMo returns context-dependent embeddings from a multi-layer bidirectional-LSTM language model. These representations improved the state-of-the-art of several NLP tasks. For the neural approach, we used bidirectional LSTMs to combine the word embeddings. We also did experiments that mix sequential models with complementary representations such as bagof-words.
The rest of the paper is as follows. Next Section presents the primary tools we used to build our systems. Section 3 presents the configuration and development of both linear and neural models. Section 4 briefly shows our results in the competition, and Section 5 concludes the work with some observations about our experience.

Previous Work
The detection of hate speech is a sentence classification task quite related to sentiment analysis and has been studied for several social media networks (Thelwall, 2008;Pak and Paroubek, 2010;Saleem et al., 2017). Regarding the detection of hateful content, Greevy and Smeaton (2004) used bag-ofwords and SVMs to detect racist content in web pages. Following a similar approach, Warner and Hirschberg (2012) used unigrams and Brown clusters with SVMs to detect anti-semitic messages on Twitter.
Waseem and Hovy (2016) annotated a corpus and used character n-grams to detect hateful comments, and Badjatiya et al. (2017) used the same dataset to train deep learning models and finetuned embeddings along with Gradient Boosted Trees. Zhang et al. (2018) trained a deep neural network combining CNNs with Gated-recurrent units (Cho et al., 2014), outperforming previous systems in several datasets.  collected a corpus of misogynous tweets and proposed a taxonomy to distinguish them into different categories. The authors proposed a number of different techniques to classify them, showing that simple approaches (as using linear models along with token n-grams) achieve competitive performance on small-sized datasets.
Regarding 2 Techniques and Resources

Preprocessing
Preprocessing is crucial in NLP applications, especially when working with noisy user-generated data. Here, we followed Luque and Pérez (2018), defining two levels of preprocessing: basic and sentiment-oriented preprocessing. We used one or the other, depending on the configuration.
Basic tweet preprocessing includes tokenization, replacement of handles, URLs, and e-mails, and shortening of repeated letters.
Sentiment-oriented preprocessing includes lowercasing, removal of punctuation, stopword, and numbers, lemmatization -using TreeTagger (Schmid, 1995)-and negation handling. For negation handling, we followed a simple approach: We find negation words and add the prefix 'NOT ' to the following tokens. Up to three tokens are negated, or less if a non-word token is found.

Bags of Words and Characters
The simplest approach considered to build tweet representations was bag-of-words encoding. A bag-of-words (BoW) builds feature vectors for each token seen in training data. For a particular tweet, its BoW vector contains the number of occurrences of each token on it, resulting in high-dimensional and sparse vectors. Variations of BoW include counting not only single tokens but also n-grams of tokens, binarizing counts, and limiting the number of features.
Character usage in tweets may also hold useful information for sentiment analysis. Character n-grams -such as the presence and repetition of uppercase letters, emoticons, and exclamation marks-may indicate a strong presence of sentiment of some kind, where others may indicate a more formal writing style, and therefore an absence of sentiment.
To capture this information, we considered a bag-of-characters (BoC) representation that encodes counts of character n-grams for some values of n. These vectors are computed from original texts of tweets, with no preprocessing at all. BoCs have the same variants and parameters as BoWs.

Word Embeddings
We used fastText, a subword-aware embeddings library (Bojanowski et al., 2016) to get contextindependent word representations. Instead of using publicly available pre-trained vectors, we trained our own embeddings on a dataset of ∼ 90 million tweets from various Spanish-speaking countries. We prepared two versions of the data: one using only basic preprocessing, and the other using sentiment-oriented preprocessing (with the exception of excepting lemmatization). For these two datasets, skip-gram embeddings were trained using different parameter configurations, including a number of dimensions, size of word and subword n-grams, and size of context window.

Tweet Embeddings
Linear combinations were used to compute a representation for a single tweet. We followed two simple approaches: plain average and weighted average. In the second case, we used a scheme that resembles Smooth Inverse Frequency (SIF) (Arora et al., 2017), inspired by TF-IDF reweighting. Each word w is weighted with a a+p(w) , where p(w) is the word unigram probability, and a is a smoothing hyper-parameter. Big values of a mean more smoothing towards plain averaging.

Context-Dependent Embeddings
After the great leap forward that represented context-independent word embeddings, a new wave came in the last years. Instead of having vectors trained for each word, context-dependent representations are generated for each token given a sentence. ELMo (Peters et al., 2018) is one of these context-dependent approaches and is based on a deep bidirectional language model (biLM). The architecture of the language model consists of L layers of bidirectional LSTMs, plus a contextindependent token representation. Hence, for each token in a sequence, we get 2L + 1 vector representations. To obtain a final vector for each token, the authors suggest collapsing the layers into vectors by means of a linear combination.
In this work, we used the implementation and pre-trained models from Che et al. (2018). The Spanish model was trained with L = 2 layers and 1024 dimensions, and the linear combination was done using a simple average.

Models
In this section, we describe the models we used in the competition.

Linear Classifiers
The first set of models we trained were simple classifying models implemented with scikit-learn (Pedregosa et al., 2011).
We started from the optimal configuration from Luque and Pérez (2018), that combines bag-ofwords (BoW), bag-of-characters (BoC) and tweet embeddings as follows: • BoW: All unigrams and bigrams of words, with binarized counts and TF-IDF reweighting. For the Spanish training dataset, this encoding gives 53504 sparse features.
• BoC: All n-grams of characters for n ≤ 5, with binarized counts and TF-IDF reweighting. For the Spanish training dataset, it gives 226156 sparse features.
• Tweet embeddings: Computed from fastText sentiment-oriented word vectors of 50 dimensions. Weighted averaging was done as described in Section 2.4, with a smoothing value of a = 0.1.
Here, the only parameters specifically optimized using the HatEval development set were the ngram ranges considered for BoW and BoC. Using this vectorial representation we trained logistic regressions and linear-kernel SVMs with different hyperparameter values. The best results are shown in the first block of Tab. 1, as LR 0 and SVM 0 .
Next, to confirm the relevance of each of the three components, we performed ablation tests for each of them. Results are displayed as SVM BoW , SVM BoC and SVM emb in Tab. 1. Drops in the performance show the relevance of all components, especially for BoW and BoC.
Next, we tried adding tweet representations computed from ELMo vectors. Full tweet vectors were obtained by doing simple un-weighted averaging. PCA was optionally used to reduce the dimension of final vectors. The best results were To participate in the Spanish subtask B (ES-B) we used a very naive approach. We didn't develop or tune a specific system for this subtask but instead used the same system and configuration that was found optimal for subtask A. To do this, we first mapped the triple classification problem to a 5-way classification problem for all the possible label combinations: Then, we simply trained the classifier using the Spanish subtask B training dataset.

Neural Models
The second set of models we trained are neural models. We trained Recurrent Neural Networks (RNNs) using pre-trained context-dependent representations for Spanish. The first model considered was a bidirectional LSTM with a dense layer on top, consuming ELMo vectors; we call this model LSTM-ELMo. Also, we tried another model by adding a second input consisting of a bag-of-words, as illustrated in Figure 1. We call this model LSTM-ELMo+BoW. Using fastText embeddings (of dimension 300 and context window 5) instead of BoW was considered as suggested by Peters et al. (2018) but discarded as it had no positive impact in performance (in the development dataset).
The biLSTM layer consists of 256 units. The bag-of-words has the 3500 most-frequent n-grams (having document-frequency less than 0.65), fol-lowed by a 512-unit dense layer. The two last dense layers have 64 neurons.
We used Keras (Chollet et al., 2015) to implement and train our models. Adam (Kingma and Ba, 2014) was the chosen optimizer, with lr = 35 * 10 −5 and decay = 0.01. To regularize our models, we applied dropout with keep-prob of 0.2 on the first layer, and 0.45 on the second, and we also early-stopped the training monitoring the performance on the development dataset. The hyperparameters were chosen from a small random search, as training ELMo is computationally expensive.   than LSTM-ELMo. This last model achieved similar results to SVM 0 . This difference between the models was not seen in English. For the Spanish subtask B (ES-B), the same SVM 0 system was used, achieving an average F1 of 0.758 and an EMR score of 0.657 over the test set (fourth place in terms of EMR).

Conclusion and future work
As in our previous experience with sentiment analysis, we found that linear models can be a match for neural models. Moreover, this time our SVM ranked in the first place in one of the subtasks.
We believe that -for this kind of challenges with small-sized datasets-preprocessing techniques, data normalization and robustness play a stronger role than model design and hyperparameter tuning. On the other hand, deep neural models are highly expressive and prone to overfitting, requiring being extremely careful with regularization.