MITRE at SemEval-2019 Task 5: Transfer Learning for Multilingual Hate Speech Detection

This paper describes MITRE’s participation in SemEval-2019 Task 5, HatEval: Multilingual detection of hate speech against immigrants and women in Twitter. The techniques explored range from simple bag-of-ngrams classifiers to neural architectures with varied attention mechanisms. We describe several styles of transfer learning from auxiliary tasks, including a novel method for adapting pre-trained BERT models to Twitter data. Logistic regression ties the systems together into an ensemble submitted for evaluation. The resulting system was used to produce predictions for all four HatEval subtasks, achieving the best mean rank of all teams that participated in all four conditions.


Introduction
The popularity of social media allows anyone to post their thoughts and opinions for all to see. While the vast majority of these communications are benign, there are those who express hateful or threatening messages online. The identification of hate speech (Fortuna and Nunes, 2018;Schmidt and Wiegand, 2017) on platforms like Twitter is of particular interest for law enforcement and to social media companies who wish to remove accounts with offending content from their sites. Automating the identification of hate speech will allow platforms to flag and remove content much more quickly and effectively.
In this effort we explored neural transfer learning techniques, including word embeddings and fine-tuning of models trained with diverse auxiliary tasks. We built and compared models employing soft attention over sequences and multiheaded self-attention. We also present a novel task to aid in performing additional pre-training of BERT (Devlin et al., 2018) for domain adaptation to Twitter data.

Task, Data and Evaluation
HatEval was a shared task organized within SemEval-2019 (Basile et al., 2019). The primary task was detection of hate speech in Twitter, specifically against immigrants and women. This multilingual shared task was organized into two sub-tasks, each presented in both English and Spanish, for a total of four sub-task evaluations.
Task A The first sub-task was simply to identify tweets containing hate speech against immigrants or women. The official metric used for this binary classification task was macro-averaged F1 score, in which the F1 scores are calculated for both the positive hate speech and negative not hate speech classes and then those two scores are averaged.
Task B The second sub-task involved the detection of two specific aspects of hate speech: whether it is targeted at an individual vs. a group of people, and whether it expresses aggression on the part of the author. In this annotation scheme, there is a dependency between these two categories and the hate speech label used in Task A, as tweets could only be labeled as positive for targeting or aggression if they were positive for hate speech. The official metric used for Task B was Exact Match Ratio (EMR), which is the proportion of tweets that are labeled correctly for all categories (hate speech, targeting, and aggression). Another way to think of this is as a five-class classification problem where the classes are (H=0, T=0, A=0), (H=1, T=0, A=0), (H=1, T=0, A=1), (H=1, T=1, A=0), (H=1, T=1, A=1). EMR on predicting the three classes separately is equivalent to accuracy on this five-class classification.

Dataset Characteristics
The English datasets consisted of 9000 tweets for train, 1000 for dev, and 3000 for test. The Spanish datasets were half the size of the English, with 4500 tweets for train, 500 for dev, and 1500 for test.
Cursory examination revealed drastic differences between the training and test sets, particularly in English. The pejorative term bitch appeared in 12% of the training tweets vs. 48% of the test tweets. The hashtags #BuildThatWall or #BuildTheWall appeared at rates of 6% and 23% in train and test, respectively. Likewise, #MAGA was in over 12% of the test set tweets but in under 3% of the training set messages. Thus the English test set appears to be dominated by a handful of heavily represented phenomena.
Different annotation strategies appear to have been used on the training and test sets as well. While tweets mentioning #BuildThatWall or #BuildTheWall were annotated as hate speech 98% of the time in the training set, this number is 35% on the test set. Similarly, tweets containing bitch were labeled as hate speech 78% of the time in the training set vs. 43% of the time in the test set.
The use of hashtags differs markedly between languages. Hashtags are much more frequent in the English training data than the Spanish training data, with English tweets 2.6 times more likely to contain at least one tag, and with tags occurring in English at 4.1 times the rate in Spanish. In the English training data, the most frequent ten hashtags were 23% of the overall total and tended towards American political topics. In Spanish, the top ten tags account for only 8% of the total, exhibiting a much longer and sparser tail.

System Overview
For each task, we created an ensemble of systems, each of which independently predicted the classes. The component systems are described in the following eight sections, after which we describe the procedure for building and testing the ensembles. All component systems described below treated Task B as a five-class prediction problem, and with the exception of two BERT-based systems, were trained to address Task A and Task B simultaneously.
Data and resources SemEval organizers provided training and development sets for English and Spanish. Planning to build ensembles, we shuffled and split out 10% of the training for calibrating models in the ensembles (calibration set from here on). Components were trained using the remaining 90% of the training sets provided, with hyperparameter search and validation using the full development sets or via cross-validation.
We did not use any additional supervised datasets.
The BiLSTM, Name Embedding, and Hashtag Prediction models incorporated pre-trained word2vec (Mikolov et al., 2013) languagespecific embeddings that we trained on 1558 billion English and 444 million Spanish tweets collected from 2011 to 2018. In both cases we applied word2phrase twice to identify phrases of up to four words, and trained a skip-gram model of size 256, using a context window of 10 words and 15 negative samples per example.
For Task A, all of our component systems and ensembles included a post-processing step to select the best threshold score for classifying hate speech in order to achieve the maximum macroaveraged F1 score on the development set.

BiLSTM with Attention
We trained several heavily regularized single-layer Bidirectional LSTM (Hochreiter and Schmidhuber, 1997) models to learn a tweet representation with soft attention (Bahdanau et al., 2014) over a sequence of pre-trained token embeddings. Hyperparameter experimentation with Spearmint (Snoek et al., 2012) suggested that a shallow network with attention outperformed deeper, stacked networks and networks without attention. Our attention layer learns to weight context-aware representations of each timestep of the input.
We trained one architecture for the English tasks and two architectures for Spanish, although the second was ablated from our Task A ensemble. The models were identical in structure and differed only in hyperparameters. All models were constructed with spatial dropout over a frozen embedding layer, followed by an embedding transform, one bi-directional LSTM layer with dropout, an attention layer, and a fully-connected hidden layer with dropout.
In each of these models, the NLP representation was used as input to a small prediction network of latent predictions and residual connections described in Section 3.5.

Name embeddings
This model added a name embedding input to our BiLSTM described above, in an effort to better model the demographics of the individuals addressed within a tweet.
We trained our name vectors using the word2vec objective. Each context was made up of multiple usernames a single Twitter user had employed during a multi-year longitudinal sample of random tweets streamed from the platform. This resulted in a vocabulary of approximately two million name pieces, which includes common names as well as alternate spellings using special characters, symbols, emoji, and other text entered in the user name field.
We extracted all substrings of at least length 3 from each username mention in a tweet and included any of them that were in our name embedding vocabulary as input to our model. We applied a learned transformation to each embedding and created a weighted combination with an attention layer. This was concatenated with a hidden representation constructed with the BiLSTM architecture described in Section 3.1. This concatenation was the input to the prediction network described in Section 3.5.
The Spanish name embedding was comprised of dropout over frozen embeddings, a dense embedding transform, and an attention layer. For English, only an attention layer over the frozen embeddings was used. The hyperparameters from our best English model were used in the BiLSTM architecture for both languages.

DeepMoji
The DeepMoji model developed by Felbo et al. (2017) predicts the emoji removed from an English-language tweet text. The authors train their RNN model on 1274 million tweets for a set of 64 emojis. Using varying degrees of finetuning and newly initialized layers, they test their distantly supervised models on several benchmark datasets for detecting emotion, sentiment, and sarcasm. The model's best results used their chainthaw fine-tuning method, which iteratively unfreezes and trains layers for the new objective. The authors distribute their trained model for the emoji prediction task.
We experimented with both chain-thaw training and models that were frozen until the final layer of abstraction in DeepMoji. The pre-trained model has a vocabulary that omits many of the hashtags and usernames that were important for our task. Our best model used 0.75 dropout over the output of a frozen DeepMoji model and three fully connected layers of sizes 512, 256, and 128 before the annotation constraint adapter. Chain thaw models performed poorly and were ablated from our Task A submission. DeepMoji models are only included in our English ensembles.

Hashtag prediction network
Following Zarrella and Marsh (2016), we implemented a recurrent neural network classifier that was pre-trained via an auxiliary masked hashtag prediction task. We extracted 30 of the top hashtags found in the training data, with 15 selected from both the hate speech positive and negative classes. Then we searched for the fifteen nearest neighbors of each tag via cosine similarity in embedding space, using vectors described in Section 3. After removing duplicates, this resulted in 136 English and 132 Spanish hashtags. We downloaded up to 1,000 recent tweets containing each hashtag from Twitter's public search API, resulting in 11,539 English tweets and 12,504 Spanish tweets. Tweets were stripped of the target hashtag(s), and each corpus was divided into a training and development set using a 90/10 split.
The sequence of vector representations of the tokens in each tweet served as the input to a neural network with a 128 LSTM units followed by a dense softmax layer over the possible candidate hashtags. Both the word embeddings and the recurrent layer were tuned. These models correctly predicted development set hashtags with 50.3% accuracy on the English data and 56.6% accuracy on the Spanish data.
The trained weights were extracted from this network and used to initialize the five-way hate speech classifier for Task B, described in Section 2, which additionally saw as input the one-hot representations of the 600 most frequent unigrams and 300 most frequent bigrams in the training data, each followed by a fully-connected dense layer. The size of each fully connected layer and amount of dropout were experimentally determined using Spearmint (Snoek et al., 2012) to maximize performance on the competition metrics on our development set.

Annotation constraint adapter
Both Task A and Task B had annotation constraints based on latent variables. In Task A, hate speech (H) was not marked as true unless the tweet was directed at women (W) or immigrants (I). In Task B, aggression (A) and individual targeting (T) were not marked as true unless hate speech directed at women or immigrants was present. Even though W and I are not directly represented in our datasets, we believe they are latent variables that can be discovered in the NLP representation. Figure 1 shows an adapter we placed at the end of several systems to encourage the network to learn these constraints. While it doesn't enforce the constraints, it sets up a principled graphical model that encourages the network to learn them. Of course, nothing prevents the network from learning to model other things with this topology. Fair comparisons to stacked dense layers with the same number of parameters showed that the network with this topology performed better. The upside to the design of a network like this is that the removal of the H switch might yield more general-purpose A and T classifiers.

Pre-training BERT with Twitter data
Pre-trained language models such as BERT (Devlin et al., 2018) have been demonstrated to achieve state of the art performance on a range of language understanding tasks. BERT uses a transformer encoder model (Vaswani et al., 2017) and pre-trains the model using two complementary objectives: masked language model, and next sentence prediction. The pre-trained model may then be fine-tuned on labeled data (in this case the Hat-Eval dataset) to perform a downstream task.
For English, we used the BERT-Large model, which has 24 layers, 1024 hidden layer size, and 16 self-attention heads. For Spanish, we used the smaller multilingual BERT, with 12 layers, 768 hidden layer size, and 12 self-attention heads. The English BERT is trained on Wikipedia and BooksCorpus (Zhu et al., 2015), while the multilingual model is trained on Wikipedia from multiple languages. As the language in these sources is likely to be quite different from the language commonly used on Twitter, we elected to perform additional pre-training using a corpus of tweets collected during the same time period as the HatEval training dataset (October 2017 -September 2018). All of the pre-training experiments described below started from the TensorFlow model checkpoints downloaded from (Google Research, 2018).
Since the tweets in our collection are not se-quential, they cannot be used for the next sentence prediction that BERT uses to learn sentence relationships. We therefore began by running 20k steps of additional pre-training using only the masked language model task. Next, we hypothesized that replacing the nextsentence prediction task with a task involving predicting some attribute of the author of the tweet would provide the model with latent information about the nature of tweets that would allow it to discriminate between different classes of tweets more accurately. We performed 20k additional pre-training steps with the user description from the author's Twitter profile standing in for the second sequence in the sentence prediction task. In other words, we trained the network to determine whether a given pair of (tweet text, author description text) were sampled from the same tweet. Finally, we pre-trained a BERT model with the screen name of the Twitter user as the secondary prediction task. Table 1 shows the validation scores for our five-class model under our different pre-training schemes: No additional, pre-training on masked LM only, pre-training MLM + Twitter user descriptions, pre-training MLM + Twitter user screen names. Additional pre-training resulted in increased validation scores on all four tasks, and incorporating user descriptions in place of the next sentence prediction task further resulted in increased scores for both Spanish tasks.

Maximizing ensemble diversity
During development, we noticed some of the neural network models with high capacity had significantly variance in prediction accuracy based on training with different subsets of the training data, hyperparameter settings or just differences in parameter initialization. Such variance would suggest using model bagging (Breiman, 1996) or other form of variance reduction. However, given the relatively long training times for some of the neural network models, especially those based on  BERT, using ensemble methods such as bagging directly proved too cumbersome as part of the model development workflow. Instead, we employed a form of negative correlation learning (Liu and Yao, 1999) to train a small ensemble of neural network classifiers within a single architecture. A term was added to the fine tuning cross entropy loss function which encouraged diversity among all pairs of classifiers following Opitz et al. (2016).

Logistic Regression
Logistic regression (LR) systems were developed as a baseline against which the neural approach would be compared. Had annotators used very simple features such as words or phrases to make decisions, they would have been found in the course of LR training. Some of the systems were good enough to include in the final ensembles. The vocabulary of the LR system was limited to the training set. Many feature sets were explored during model search. The best models preferred feature sets rather than counts or term frequencies.
Word n-grams of length 1-3 and character n-grams to length 8 were all considered, along with skip bigrams. The specifics of the best resulting feature sets are in Table 3. Table 2 shows the most important features from an English Task A LR system, sorted by feature influence, the product of feature function standard deviation and model weight. The second column is model weight, with negative weights contributing to a (H=1) decision.
In all cases, a bias term was added and Liblinear (Fan et al., 2008) was used to compute the model. L2 regularization was used to encourage generalization. Cross-validation was used to pick the regularization parameters.

Ensemble
Many systems were created, and final ensembles were constructed by incremental ablations. An initial all-in ensemble was created and tested, then it was tested with each component removed. This process was iterated on the best performing ablated sets until gains were no longer observed. Approximately two thousand total ensembles were created through the ablative search. Two systems were ablated in Task A EN, three in Task A ES, one in both Task B conditions. Those systems are not described in this paper.
Ensembles were constructed using logistic regression on either the classifier outputs or the classifier outputs and final probabilities from the model. One oddity to note is that the ensembles using the probabilities performed better for Task A and the ensembles ignoring the probabilities performed better in Task B. Table 3 shows ensemble compositions for each of the four tested conditions. The first column, labeled influence, indicates the influence that the particular component has on the ensemble. It is the number of cases in which that component's contribution changes the outcome of the ensemble. It is calculated by zeroing out all LR weights for that particular component and noting the difference. In English, the BERT models had the most influence, while in Spanish, the influence was more evenly distributed across the components. Table 3 shows performance of our component models and ensembles. The calibration set factored column shows the performance of the component on our calibration data. This is the macro averaged F1 score for Task A and Exact Match Ratio for Task B. The calibration set ablated column shows the performance of the ensemble when that component is removed and the ensemble parameters are re-optimized. Finally there are the scores we calculated after the evaluation period for each of our components using the released reference sets.

Results
The official scores achieved by our ensembles

Conclusion
An ensemble of models was used to classify tweets according to whether they contained hate speech, aggression, and targeting of individuals. The novel contributions include using name embeddings, substituting twitter author profile prediction for next sentence prediction in BERT pre-training, and augmenting BERT's fine-tuning loss function with a diversity term to create an ensemble. There is a discrepancy between the official test set results and our held-out calibration set, particularly in the English subtasks, which we attribute to dataset divergences like those called out in Section 2.