Nikolov-Radivchev at SemEval-2019 Task 6: Offensive Tweet Classification with BERT and Ensembles

This paper examines different approaches and models towards offensive tweet classification which were used as a part of the OffensEval 2019 competition. It reviews Tweet preprocessing, techniques for overcoming unbalanced class distribution in the provided test data, and comparison of multiple attempted machine learning models.


Introduction
The purpose of this paper is to explore different approaches towards classifying tweets based on whether they are offensive or not, whether offensive tweets are targeted, and identifying the target group of offensive tweets either an individual, a group, or other.Those are the terms of the OffensEval 2019 competition in which we participated.Each of the described activities constituted a separate subtask from the competition.A maximum of three submissions were allowed per subtask which required careful preliminary analysis of the model results during the training phase.A training set of over 13,000 tweets, containing labels for all three subtasks.Each of the subtasks was scored using macro F1 score.

Related Work
One of the most effective strategies for tackling this problem is to use computational methods to identify offense, aggression, and hate speech in user-generated content (e.g. posts, comments, microblogs, etc.).This topic has attracted significant attention recently as evidenced in publications from the last two years.
Survey papers describing key areas that have been explored for this task include (Schmidt and Wiegand, 2017), (Fortuna and Nunes, 2018) and (Malmasi and Zampieri, 2017).The dataset for this competition is explained in (Zampieri et al., 2019a) and different approaches to the same problem are reported in (Zampieri et al., 2019b).
In order to classify correctly abusive language it is important to analyze its types.A proposal of typology of abusive language sub-tasks is presented in (Waseem et al., 2017) and(ElSherief et al., 2018) examines the target of the speech: either directed towards a specific person or entity, or generalized towards a group of people sharing a common protected characteristic.(Fišer et al., 2017) proposes a legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia.Finally, a recent discussion on identifying profanity vs. hate speech is presented in (Malmasi and Zampieri, 2018).This work highlighted the challenges of distinguishing between profanity, and threatening language which may not actually contain profane language.
Approaches to detecting hate speech on Twitter using convolutional neural networks and convolution-GRU based deep neural network are discussed in (Gambäck and Sikdar, 2017) and (Zhang et al., 2018) respectively.
Additional related work is presented in workshops such as TA-COS1 , Abusive Language Online2 , and TRAC3 and related shared tasks such as GermEval (Wiegand et al., 2018) and TRAC (Kumar et al., 2018).

Methodology and Data
The data was split into a training and validation set in a ratio of 10:1.All tasks had similar preprocessing and multiple models were trained on the training set.Depending on their performance on the validation set each time the best 3 were submitted.

Preprocessing
We started our tweet preprocessing by removing most punctuation marks which do not include any useful information for text classification.The symbols '@' and '#' were excluded from the list due to their specific semantics in tweets.Afterwards the tweets were subjected to tokenization and lowercasing.
All occurrences of tokens beginning with a hashtag were split into the separate words comprising the token, provided that each separate word is uppercased.For example, the token #HelloThere is split into two tokens hello and there.
Afterwards we proceeded with removing a variety of different stop words.When training models for the second and third subtask, we excluded personal and possessive pronouns from the list of stop words, as they can contain valuable information for classifying a tweet as targeted or not, or identifying the target group of a targeted tweet.We also attempted lemmatization and spell correction but the results were slightly worse or on par with the ones achieved without using these two techniques.
Pre-trained word vectors on Twitter from project GloVe (Pennington et al., 2014) were used for encoding words to a vector space.Four different vector dimensions were available for use 25, 50, 100, and 200.Although results were slightly better when using higher dimensional vectors, using 200-dimensional vectors proved to have no significant advantage in achieved results over 100-dimensional ones, and proved to be more computationally expensive, which lead us to use 100-dimensional vectors for each subtask.

Models
We trained a large variety of different models and combined the best of them in ensembles.For all models the embedding layer was freezed, becaused that proved less prone to overfitting.
• Standard Nave Bayes and Support Vector Machine (SVM) from scikit-learn library in python.
• Convolutional Neural Network (CNN) with GlobalMaxPooling and hidden dense layer on top.
• FastText models with n-grams of size 2.
• Recurrent Neural Network (RNN) with GRU units and attention layer and hidden dense layer on top.
• Soft Voting Classifier (SVC) -averages the predictions of the single models.
• Logistic Regression -meta model trained on half of the validation set with predictions from the single classifiers as features.

Class imbalance
One of the challenges of the competition was the imbalance of classes for the second and third subtask.We experimented with different techniques for overcoming this challenge: • Oversampling duplicating some of the examples from the poorly represented classes.
• Class weights assigning lower weights to examples from classes which are better represented and higher weights to examples from classes with a lower overall count.
• Modification of the thresholds used for classifying an example.For example, for a standard binary classification a threshold of 0.5 is applied to the predicted probability in order to distinguish between the two classes.We attempted to lower this threshold to different levels.
For all model apart from BERT the class weight option was chosen.Only for BERT on subtask C the thresholds were changed instead.For classes OTH and GRP we used thresholds of 0.2 and 0.3 respectively and if any of them was exceeded we would directly assign that class.If both were exceeded we would assign OTH as the class.The coefficients were derived via cross-validation.

Results
The results from the test sets for each subtask are displayed below.We have also provided the results from our validation sets, those were the basis upon which we decided which models predictions to submit.
The individual model with the best performance on subtask A was BERT-Large, Uncased with a macro F1 score of 0.781 on the validation set and was selected as one of the models for submission.The other two submitted models were the soft voting classifier with score of 0.788 and the logistic regression model 0.800.The scores of the other trained models are displayed below.The ensemble models proved to have overfit on the training data and out of the models we have submitted BERT had the highest score, ranking second overall amongst all participants.

System
In subtask B the highest scoring models on the validation set was the soft voting classifier with a score of 0.64, closely followed by RNN and CNN 0.63.BERT-Base, Uncased performed surprisingly poorly and achieved a score of 0.59.
The soft voting classifier scored the highest on the test set and ranked 16th overall.

Conclusion
Google's BERT model proved to be a powerful tool for text classification.Not only did it outperform common models on the validation set, but based on the results from the test set it did so without overfitting on the data.

Table 1 :
Results on the validation set for Sub-task A.

Table 2 :
Results on the test set for Sub-task A.

Table 3 :
Results on the validation set for Sub-task B.

Table 4 :
Results on the test set for Sub-task B.

Table 5 :
Results on the validation set for Sub-task C.

Table 6 :
Results on the test set for Sub-task C.