SSN_NLP at SemEval-2019 Task 6: Offensive Language Identification in Social Media using Traditional and Deep Machine Learning Approaches

Offensive language identification (OLI) in user generated text is automatic detection of any profanity, insult, obscenity, racism or vulgarity that degrades an individual or a group. It is helpful for hate speech detection, flame detection and cyber bullying. Due to immense growth of accessibility to social media, OLI helps to avoid abuse and hurts. In this paper, we present deep and traditional machine learning approaches for OLI. In deep learning approach, we have used bi-directional LSTM with different attention mechanisms to build the models and in traditional machine learning, TF-IDF weighting schemes with classifiers namely Multinomial Naive Bayes and Support Vector Machines with Stochastic Gradient Descent optimizer are used for model building. The approaches are evaluated on the OffensEval@SemEval2019 dataset and our team SSN_NLP submitted runs for three tasks of OffensEval shared task. The best runs of SSN_NLP obtained the F1 scores as 0.53, 0.48, 0.3 and the accuracies as 0.63, 0.84 and 0.42 for the tasks A, B and C respectively. Our approaches improved the base line F1 scores by 12%, 26% and 14% for Task A, B and C respectively.


Introduction
Offensive language identification (OLI) is a process of detecting offensive language classes (Razavi et al., 2010) such as slurs, homophobia, profanity, extremism, insult, disguise, obscenity, racism or vulgarity that hurts or degrades an individual or a group from user-generated text like social media postings. OLI is useful for several applications such as hate speech detection, flame detection, aggression detection and cyber bullying. Recently, several research work have been reported to identify the offensive languages using social media content. Several work-shops such as TA-COS 1 , TRAC 2 (Kumar et al., 2018a), Abusive Language Online 3 and GermEval (Wiegand et al., 2018) have been organized recently in this research area. In this line, Of-fensEval@SemEval2019 (Zampieri et al., 2019b) shared task focuses on identification and categorization of offensive language in social media. It focuses on three subtasks namely offensive language detection, categorization of offensive language and offensive language target identification. Sub Task A aims to detect text as offensive (OFF) or not offensive (NOT). Sub Task B aims to categorize the offensive type as targeted text (TIN) or untargeted text (UNT). Sub Task C focuses on identification of target as individual (IND), group (GRP) or others (OTH). Our team SSN NLP participated in all the three subtasks.

Related Work
Several research work have been reported since 2010 in this research field of hate speech detection (Kwok and Wang, 2013;Burnap and Williams, 2015;Djuric et al., 2015;Davidson et al., 2017;Malmasi and Zampieri, 2018;Schmidt and Wiegand, 2017;Fortuna and Nunes, 2018;ElSherief et al., 2018;Gambäck and Sikdar, 2017;Zhang et al., 2018;Mathur et al., 2018). Schmidt and Wiegand (2017) & Fortuna and Nunes (2018) reviewed the approaches used for hate speech detection. Kwok and Wang (2013) used bag of words and bi-gram features with machine learning approach to classify the tweets as "racist" or "nonracist". Burnap and Williams (2015) developed a supervised algorithm for hateful and antagonistic content in Twitter using voted ensemble meta-classifier. Djuric et al. (2015) learnt distributed low-dimensional representations of social media comments using neural language models for hate speech detection. Davidson et al. (2017) used n-gram (bigram, unigram, and trigram) features with TF-IDF score along with crowd-sourced hate speech lexicon and employed several classifiers including logistic regression with L1 regularization to separate hate speech from other offensive languages. Malmasi and Zampieri (2018) used n-grams, skip-grams and clustering-based word representations as features with ensemble classifier for hate speech detection. ElSherief et al. (2018) performed linguistic and psycholinguistic analysis to detect the hate speech is either "directed" towards a target, or "generalized" towards a group. Gambäck and Sikdar (2017) used deep learning using CNN models to detect the hate speech as "racism", "sexism", "both" and "nonhate-speech". They used character 4-grams, word vectors based on word2vec, randomly generated word vectors, and word vectors combined with character n-grams as features in their approach.  Kumar et al. (2018b) presented the findings of the shared task on aggression identification which aims to detect different scales of aggression namely "Overtly Aggressive", "Covertly Aggressive", and "Non-aggressive". Madisetty and Desarkar (2018) used CNN, LSTM and Bi-LSTM to detect the above scales of aggression. Waseem et al. (2017) &Park andFung (2017) presented the methodologies on abusive language identification using deep neural networks.
Research on identifying offensive languages has been focused on non-English languages like German (Wiegand et al., 2018), Hindi (Kumar et al., 2018b), Hinglish: Hindi-English (Mathur et al., 2018), Slovene (Fišer et al., 2017) and Chinese (Su et al., 2017). Wiegand et al. (2018) presented an overview of GermEval shared task on the identification of offensive language that focused on classification of German tweets from Twitter. Kumar et al. (2018b) focused on the shared task to identify aggression on Hindi text. Mathur et al. (2018) applied transfer learning to detect three classes namely "nonoffensive", "abusive" and "hate-speech" from Hindi-English code switched language. Fišer et al. (2017) presented a framework to annotate offensive labels in Slovene. Su et al. (2017) rephrased profanity in Chinese text after detecting them from social media text.

Data and Methodology
In our approach, we have used OLID dataset (Zampieri et al., 2019a) given by OffensE-val@SemEval2019 shared task. The dataset is given in .tsv file format with columns namely, ID, INSTANCE, SUBA, SUBB, SUBC where ID represents the identification number for the tweet, INSTANCE represents the tweets, SUBA consists of the labels namely Offensive (OFF) and Not Offensive (NOT), SUBB consists of the labels namely Targeted Insult and Threats (TIN) and Untargeted (UNT) and SUBC consists of the labels namely Individual (IND), Group (GRP) and Other (OTH). The dataset has 13240 tweets. All the instances are considered for Sub Task A. However, we have filtered and considered the data that are labelled with "TIN/UNT" and "IND/GRP/OTH" for Sub Task B and Sub Task C respectively by ignoring the instances labelled with "NULL". Thus, we have obtained 4400 and 3876 instances for Sub Task B and Sub Task C respectively. We have preprocessed the data by removing the URLs and the text "@USER" from the tweets. Tweet tokenizer 4 is used to obtain the vocabulary and features for the training data.
We have employed both traditional machine learning and deep learning approaches to identify the offensive language in social media. The models that are implemented for the three sub-tasks are given in Table 1.
In deep learning (DL) approach, the tweets are vectorized using word embeddings and are fed into encoding and decoding processes. Bidirectional LSTMs are used for encoding and decoding processes. We have used 2 layers of LSTM for this. The output is given to softmax layer by incorporating attention wrapper to obtain the Offen-sEval class labels. We have trained the deep learning models with a batch size 128 and dropout 0.2 for 300 epochs to build the model. We have em-  (Luong et al., 2015(Luong et al., , 2017 in this approach. These two variations are implemented to predict the class labels for all the three sub tasks. These attention mechanisms help the model to capture the group of input words relevant to the target output label. For example, consider the instance in Task C: "we do not watch any nfl games this guy can shove it in his pie hole". This instance clearly contains the offensive slang "pie hole" and about watching the "nfl games". The attention mechanism captures these named entities or group of words and correctly map to the label "GRP". Also, it is evident from the earlier experiments (Sutskever et al., 2014;Thenmozhi et al., 2018) that bi-directional LSTM with attention mechanism performs better for mapping input sequences to the output sequences. In traditional learning (TL) approach, the features are extracted from the tokens with minimum count of two. The feature vectors are constructed using TF-IDF scores for the training instances. We have chosen the classifiers namely Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) with Stochastic Gradient Descent optimizer to build the models for Task B and Task C respectively. These classifiers have been chosen based on the cross validation accuracies. The class labels namely "TIN/UNT" and "IND/GRP/OTH" are predicted for Task B and Task C using the respective models.

Results
We have evaluated our models using the test data of OffensEval@SemEval2019 shared task for the three sub tasks. The performance was analyzed using the metrics namely precision, re-   call, macro-averaged F1 and accuracy. The results of our approaches are presented in Tables 2, 3 and 4 for Task A, Task B and Task C respectively. We have obtained the best results for Task A DL SL, Task B DL NB, Task C TL SVM models for Task A, Task B and Task C respectively.  The attention mechanism Scaled Luong performs better when more data is available for training. Normed Bahdanau attention mechanism performs better even for a small dataset. However, deep learning gives poor results than traditional learning approach for Task C, because only 3876 instances were considered for model building. The deep learning model could not learn the features appropiately due to less domain knowledge imparted by the smaller data set. Thus, traditional learning performs better with the given data size when compared to deep learning for Task C. The confusion matrix for our best run in the three sub tasks are depicted in Tables 5, 6 and 7. These tables show that the true positive rate of "NOT", "TIN" and "IND" classes are good as the number of samples for those classes are more in training set. Our approaches show improvement over the base line systems for all the three tasks. We have obtained 12% and 14% improvement on F1 and accuracy respectively for Task A when compared with the base line. For Task B, we have obtained 26% and 34% improvement on F1 and accuracy respectively. Also, Task C results have been improved by 14% and 7% for F1 and accuracy when compared to base line results.

OFF NOT
OFF 73 147 NOT 167 473