Zeyad at SemEval-2019 Task 6: That’s Offensive! An All-Out Search For An Ensemble To Identify And Categorize Offense in Tweets.

The objective of this paper is to provide a description for a classification system built for SemEval-2019 Task 6: OffensEval. This system classifies a tweet as either offensive or not offensive (Sub-task A) and further classifies offensive tweets into categories (Sub-tasks B - C). The system consists of two phases; a brute-force grid search to find the best learners amongst a given set and an ensemble of a subset of these best learners. The system achieved an F1-score of 0.728, ranking in subtask A, an F1-score score of 0.616 in subtask B and an F1-score of 0.509 in subtask C.


Introduction
In OffensEval we break down offensive content into three sub-tasks taking the type and target of offenses into account. Sub-task A -Offensive language identification; In this sub-task we are interested in the identification of offensive posts and posts containing any form of (untargeted) profanity. In this sub-task there are 2 categories in which the tweet could be classified Not Offensive -This post does not contain offense or profanity. Nonoffensive posts do not include any form of offense or profanity. Sub-task B -Automatic categorization of offense types; In this sub-task we are interested in categorizing offenses. Tweets are labeled from one of the following categories Targeted Insult -A post containing an insult or a threat to an individual, group, or others; Untargeted -A post containing non-targeted profanity and swearing. Posts containing general profanity are not targeted but they contain non-acceptable language. On the other hand, insults and threats are targeted at an individual or group. Sub-task C -Offense target identification. Finally, in sub-task C we are interested in the target of offenses. Only posts which are either insults or threats are included in this subtask. The three categories included in sub-task C are the following: Individual -The target of the offensive post is an individual: a famous person, named individual or an unnamed person interacting in the conversation. Group -The target of the offensive post is a group of people considered as a unity due to the same ethnicity, gender or sexual orientation, political affiliation, religious belief, or something else. Other The target of the offensive post does not belong to any of the previous two categories (e.g. an organization, a situation, an event, or an issue). (Zampieri et al., 2019b) To work with such complicated tasks our approach is an exhaustive one. We try combinations of many techniques in pre-processing, feature extraction and classification while tuning their hyper-parameters to find the best models with the leading F1-scores. With the gained information an ensemble of the top three models is formed to get the optimum result. Along side to this approach we try a deep-learning method with a simple 1D -CNN consisting of 3 convolutional layers and a softmax layer just to compare results.

Related Work
Offensive language on social media hardly remains unnoticed. Contents involving hateful messages vary from hate speech to group-based racism and could target anyone irrespective of their status, identity, location and so forth. Even when it is not materialized into a hate-motivated crime, the damage is done victims are being labeled, marginalised and exposed to negative stereotyping. The overall consequences of online hate can be the dehumanisation of individuals or groups of individuals. The need for proper strategies to tackle hate speech on social media is unquestionable. The core focus of the thesis is not to find a solution to the challenge, but rather to identify central problems that have contributed to the formation of the existing reality. To unrave the contributing factors, a holistic analysis of both international human rights principles regarding hate speech and the practical application of those standards is necessary. (Schofield and Davidson, 2017) There have been many studies and publication on the topic of offensive language and hate speech over the last few years. Examples on such studies include , (Malmasi and Zampieri, 2017), (ElSherief et al., 2018), (Gambäck and Sikdar, 2017), (Zhang et al., 2018). Also there have been challenges on how to distinguish profanity from hate-speech presented by (Malmasi and Zampieri, 2018).

Methodology and Data
The used dataset in this assignment is the one provided in SemEval-2019 task 6. The dataset has been collected from Twitter. It was retrieved by searching offensive terms that could be present in a tweet. It consists of 14,100 tweets in total. It was annotated using crowdsourcing. The gold labels were assigned taking the agreement of three annotators into consideration. No correction has been carried out on the crowdsourcing annotations. The dataset was presented in two phases; Training data: already labeled tweets used to train the classifiers. Each tweet was provided with a binary classification label and an index. Testing data: unlabeled tweets to test the classifiers against. Zampieri et al. (2019a).
The system is a combination of three essential layers. First, pre-processing which is a necessary step in NLP as textual data could and most likely is not clean, thus will affect further stages and create an incoherent model. Second, feature extraction or vectorization, which translates words to a number or a series of numbers with different weights to represent this word. Finally, classification, features extracted from the previous step is fed into a learner and a model is created that could classify tweets.
For our approach we implemented a heap of pre-processors, vectorizers and classifiers and with the help of brute-froce search ranked all the resulting models according to their F1-scores. All implemented techniques are available in Table 1

Pre-processing
A tweet contains many unwanted data that would take extra computational power and decrease the accuracy of the model. So, noise removal and some normalization techniques must be applied to the corpus in-order to generate more consistent models. Stopword Removal is a noise removal method by filtering words that dont have significance in the context of the sentence, without them the semantics of the tweet wont be affected. Lemmatization is the process of getting the linguistic root of a word. First, words are part-of-speech tagged , then converted to their roots. Stemming is the process of stripping a word of it's prefixes and suffixes using the porter-stemmer algorithm (Porter, 1980).

Feature Extraction
Now that we've got our clean, almost noise-free textual data, we cant simply feed a classification model a bunch of text words, most models only work with numerical data. This is where we convert words to numerical features using one the methods mentioned below to create our classification-ready data. We use three word embedding models of embedding dimension 100 (which gave adequate results after experimenting with other dimensions) along side to the standard TFIDF/Count models.
• Word2Vec model trained on our dataset. (Mikolov et al., 2013) • fastText model trained on our dataset. (Joulin et al., 2016) • Pre-trained GloVe model trained on 2 Billion tweets -27 Billion tokens -1.2 million vocabulary. (Pennington et al., 2014) All three models mentioned are zero-padded with the maximum length of a tweet present in the dataset to resolve the uneven dimensionality issue. A list of all techniques is initialized for later usage in the search for the best model.

Classification and Tuning
This is where all the previous work comes together for the final phase of the system. Seven models where chosen and tuned using sci-kit learns (Pedregosa et al., 2011) GridSearchCV, which does a cross validation search on a list of hyper-parameters for a given model. The parameters grids that were tested are available in Table 2.

Model
Parameters Grid  Again, a list of classifiers and their parameters grids is initialized to tune them with a 3-fold cross validation.

All-Out Search
This is the body of all the work. We try every possible combination of pre-processing, vectorization and classification to ensure the output has the best possible F1-score for the given subtask. We start by cleaning the data using a certain combination of pre-processors, then extracting features using one of the vectorizers and finally to complete the pipeline, tune a classifier's hyper-parameters on the resulting data-matrix. And repeat for the next combination.

sort(models)
The resulting set 'models' is a set of each classifier and a list of parameters and their corresponding preprocessors, vectroizers and F1-scores. The results could be plotted to help visualize the performance of each model seen in Figure 1, which shows the top 3 models for the logistic regression classifier. Each 3 bars represent the hyperparameters combination and the top 3 combinations of pre-processing and vectorization. The best F1-score (0.683) came from a pre-processing of stopwords removal followed by lemmatization, a count vectroizer and hyper-parameters [penalty: l2, solver: sag]. Following this search, now that  Table 3: Top 3 models for each subtask, these 3 models will form an ensemble to enhance the performance.
we have the scores of each model, we can model an ensemble of the top 3 models to give us a better overview of the data available in Table 3. And just to add an extra layer we can re-tune the classifier parameters in case of any error that could have appeared in the previous step.

Results
We submitted with a couple of models, for subtask A, an ensemble of the three top models mentioned in Table 3, the Random Forest model and a 1-D CNN. The ensemble did its best in subtask A but the Random Forest (RF) model came a very close second. Results can be viewed in Table 4, and confusion matrix Figure 2.  As for subtask B, the ensemble submission unfortunately failed, but it didn't look good anyway. The best model was as simple Naive Bayes (NB)-TFIDF model which got a very good F1-score of 0.887. Results can be viewed in Table 5, and confusion matrix Figure 3.  Finally, for subtask C, we chose to let go of the CNN model as it didn't get an acceptable result, and went for the ensemble, which got the best accuracy but came second for F1-scores and the RF model which also got a good accuracy but a poor F1-score, and the best model was a logistic regression-count model with an F1-score of 0.5093. Results can be viewed in Table 6, and confusion matrix Figure 4.  Table 6: Results for Sub-task C. The ensemble got the best accuracy but, LR got a better F1-score.
Looking at these results, we hypothesize that the systems performance can be improved by com-bining all word embedding features instead of using them individually. It was also remarkable that the for most subtasks a simple Naive Bayes -TFIDF model came close to being the best amongst all others. We also believe better results can be achieved if there was the dataset was more balanced and having more offensive tweets, and if we had sufficient time to perform grammar checking on the tokens and other operations that can reduce noise. The problem of out-of-vocabulary (OOV) words which we unfortunately didn't attempt to solve, could be later be solved by using a character-level embedding model rather than a word embedding one.

Conclusion
This paper describes our offensive tweets identification and categorization system that was built in the framework of SemEval-2019 Task 6. We used a brute-force search technique to find the best model that could be generated from a list of prepocessing techniques, feature extraction models and classifiers and got an F1-sore of 0.728 in subtask A, 0.6161 in subtask B and 0.5093 in subtask C. In future work, we aim to focus more on word embedding features by concatenating all 3 word vector models and experiment with character-level/sentence-level models.