Pardeep at SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media using Deep Learning

The rise of social media has made information exchange faster and easier among the people. However, in recent times, the use of offensive language has seen an upsurge in social media. The main challenge for a service provider is to correctly identify such offensive posts and take necessary action to monitor and control their spread. In this work, we try to address this problem by using sophisticated deep learning techniques like LSTM, Bidirectional LSTM and Bidirectional GRU. Our proposed approach solves 3 different Sub-tasks provided in the SemEval-2019 task 6 which incorporates identification of offensive tweets as well as their categorization. We obtain significantly better results in the leader-board for Sub-task B and decent results for Sub-task A and Subtask C validating the fact that the proposed models can be used for automating the offensive post-detection task in social media.


Introduction
Social media has revolutionized the way of communication among the people. It is an instant communication medium which connects people all over the world and shares their views. But, some people misuse this freedom by using the offensive language through posts or comments to defame, insult or target an individual or a group of individuals. The mainstream media have reported various cases of suicide and depression due to trolling and cyberbullying in social media. Hence it becomes worrisome for the corporates, government organizations and security agencies to either stop or mitigate this type of behavior of the users. Manually it is impossible to check the negative behavior of users due to the volume, velocity and variety of data coming from the social networks. Hence there is an utmost need to develop a system which automatically identifies and categorizes the offensive language in social networks. To tackle these issues SemEval-2019 (Zampieri et al., 2019b) aimed exactly at that need and organized a task in identifying and categorizing offensive language in social media. This task is divided into three Sub-tasks. Sub-task A -Offensive language identification. Sub-task B -Categorization of offense types. Sub-task C -Offense target identification. All the three Sub-tasks are related to each other. In Sub-task A, we have to identify whether a given set of tweets is offensive or not. It is a binary classification task based on tweet text. In Sub-task B, the main challenge is to categorize the tweets which are offensive in Sub-task A into targeted or untargeted. Sub-task C is comparatively challenging than other two Sub-tasks due to the multi-class nature. Its goal is to identify the tweets which are targeted in Sub-task B and categorized those tweets into individual, group or others. Our approach for the SemEval-2019 task 6 (identifying and categorizing offensive language in social media) comprises of deep learning models: Bidirectional LSTM, Bidirectional GRU and standard LSTM. These are popularly used deep learning sequence models applied in many text classification tasks. We used the pre-trained word level embedding GloVe (Global Vectors for Word Representation) to get vector representations for words that appeared in tweets and used these representations as features for training the models. To check the performance of models, 10 fold cross-validation was applied on the given training data. We compared the results of the above-mentioned models with various baselines such as Logistic Regression, Support Vector Machine, Gradient Boosting and XGBoost. The baseline models are reasonably good but they have poor classification Accuracy as compared to deep learning models. This paper presents the description of our approaches and results for SemEval-2019 task 6.

Related Work
This section discusses some existing work related to identifying and categorizing offensive language in social media. Researchers have applied various computational methods to deal with hate speech, aggression, offensive language, racist and sexist language, and cyberbullying. Hate Speech: Detection of hate speech is modeled in . The authors applied CNN and GRU deep neural networks along with pre-trained Google Word2vec word embedding to detect the hate speech on Twitter. (Zhang and Luo, 2018) proposed Skip Gram Extraction CNN (SKIP-CNN) deep neural network model to identify hate speech present in social media text. It is discussed in this paper that hate speech lacks distinctive and unique features in a dataset which is hard to discover. The proposed model serves as a feature extractor for capturing the semantics of hate speech in social media. Aggression: A method to detect aggression in social media is proposed in (Madisetty and Desarkar, 2018). The authors applied CNN, LSTM and Bidirectional LSTM on Facebook comment dataset. The output of these three deep learning models are used as an input to the majority based ensemble method for detection of aggression in social media. Another paper (Kumar et al., 2018) presents the system description report of shared task on identification of aggression in social media as a part of the 1st workshop on trolling, aggression and cyberbullying (TRAC1). The aggression annotated dataset of Facebook posts and comments in English and Hindi language were provided to the participants for training and validation. Six models out of the top ten best performing models were trained using LSTM, Bidirectional LSTM, CNN, and RNN deep neural networks. Racist and Sexist Language: (Davidson et al., 2017) focused on classifying homophobic and racist tweets as hate speech and sexist remarks tweets as offensive. They use Logistic Regression with L2 regularization to predict the class membership. (Pitsilis et al., 2018) proposed an ensemble LSTM deep learning classifier that utilizes the user behavior metric to show each user viewpoint towards racism and sexism captured by their tweeting history. Cyberbullying: (Dadvar et al., 2013) studied about the Cyberbullying detection. They combine individual comments, user characteristics and user profile information for training the Support Vector Machine classifier. It is also reported that the addition of user history with text features improves cyberbullying detection accuracy. (Rafiq et al., 2018) proposed a multi-stage cyberbullying detection mechanism by two novel components. First is dynamic priority scheduler which drastically reduces the classification time, and second is incremental classification method which is highly responsive regarding time to raise alerts. Until now there have been many publications and studies on offensive language, aggression and hate speech in social media. Examples include (Wiegand et al., 2018), (ElSherief et al., 2018) and (Fortuna and Nunes, 2018). All these methods have some pros and cons associated with them. Therefore this paper proposed the idea of using deep learning sequence models for better accuracy in results for SemEval-2019 task 6.

Methodology and Data
In this section, we first describe the dataset used in the competition and then we explain the description of approaches used for solving the problem.

Dataset Used
The dataset provided by the task organizers is OLID (Offensive Language Identification). The details of data and annotation are available in (Zampieri et al., 2019a). For Sub-task A, this dataset contains tweets labeled into the following two categories: offensive (OFF) and not offensive (NOT). For Sub-task B, tweets are labeled into the following two categories: targeted input (TIN) and untargeted (UNT). For Sub-task C, the given tweets are classified into the following three categories: group (GP), individual (IND) and others (OTH). Out of 13,240 training samples of Subtask A, 4404 samples have been allocated to Subtask B and 3,877 samples have been allocated to Sub-task C. All the tweets are in English language. The statistics of the dataset and some instances of tweets with their labels are shown in Table 1 and  Table 2.

Training Set samples
Testing Set samples Sub-task A 13240 860 Sub-task B 4404 240 Sub-task C 3877 213

Methodology
Here we discuss our proposed approach in details. Our initial approach was to check with standard machine learning algorithms like Logistic Regression (Hosmer Jr et al., 2013), Random Forest (Xu et al., 2012), Support Vector Machines (Chang and Lin, 2011), XGBoost (Chen and Guestrin, 2016) and Gradient Boosting (Natekin and Knoll, 2013). We use TF-IDF vectorization for vectorizing our text and then apply the above-mentioned algorithms for the model development. Performance of these algorithms were not quite acceptable as it gave low Accuracy in results. To overcome above mentioned issues, we use deep learning algorithms for classifying the text. First we convert the text into vector representations with the help of GloVe (Pennington et al., 2014) word level embeddings and then use these representations as an input to the deep learning models described in the subsequent sections for classification tasks. The multi-layered architecture of our approach presented in Figure 1. It comprised of various components in the form of layers. Since the data is in the form of text and the first step is to vectorize the text. To achieve this, we first make the tokens of the text W1, W2, W3,..., Wn and apply pre-trained GloVe word embeddings to get vector representations R1, R2, R3,..., Rn from it. Next layer can be LSTM, Bidirectional LSTM or Bidirectional GRU Block described in the next subsections. To overcome the problem of overfitting, we add a small amount of dropout. Finally, we use a Dense layer and Softmax/Sigmoid layer to get the output of the models.

Long Short Term Memory (LSTM)
We first try LSTM (Hochreiter and Schmidhuber, 1997) which have been used successfully in many text classification tasks (Madisetty and Desarkar, 2018). LSTM is special kind of RNN which captures the long contexts and long-range dependencies very efficiently in the sentences and takes care of the vanishing gradient problem of RNN (Lipton et al., 2015) with the help of carefully regulated structures named gates. The main components of the LSTM model are input gate, forget gate, output gate and candidate memory state. All these gates are single layered neural networks with the Sigmoid activation function except candidate memory state which uses tanh as the activation function.

Gated Recurrent Unit (GRU)
Gated Recurrent Unit (Tjandra et al., 2016) is an improvisation over LSTM. They also take care of the vanishing gradient problem of the RNN and tries to capture long-range connections better but with a less number of gates than LSTM. This leads to a less amount of parameters for the model which enables a faster and efficient model development in comparison to the LSTM based model.
The main components of GRU are reset gate, update gate and current memory content. Similar to LSTM, both reset gate and update gate are single layered neural networks with the Sigmoid activation function except current memory content which use tanh as the activation function. The basic function of the reset gate is to determine how much of the past information to be lost whereas the update gate decides how much of the information the model should pass to the next states.

Bidirectional LSTM and GRU
Both LSTM and GRU uses sequential information of the textual data for the processing and capture much longer range dependencies. But, the catch is that they use the sequence of only one direction while the Bidirectional version of the same considers a reverse copy of the provided input. In certain problems, this reversal helps to a better feature understanding and improved model performance. In our work, we mainly use standard LSTM and Bidirectional version of both LSTM and GRU for the model developments. The detailed experimental setup is described in the next section.

Experimental Setting
For implementing the models, we use Keras (Ketkar, 2017) and Scikit-learn (Pedregosa et al., 2011) python framework libraries. The experimental details and model configuration are shown in Table 3. For the effectiveness of models, we add a small proportion of dropout. For GRU model, we specify the number of Recurrent Units. In terms of training, we use categorical cross Entropy as a loss function with ADAM as the optimization function. All the models are tested using 10 fold cross-validation.

Model Configuration
Value sentences length 32 batch size 64 recurrent units (for GRU) 64 dense size 16 dropout rate 0.5 number of epochs 300

Impact of Batch Size on Model Performance
We checked our proposed approach with three different batch sizes 64, 128 and 256 to check its im-pact on model performance. It is found experimentally that batch size 64 provides optimal results. Performance metrics: The official evaluation metric for all the three Sub-tasks are the macroaveraged F1 score. For additional analysis, we use the Accuracy, Precision, Recall and ROC-AUC.

Results and Discussions
This section contains the detailed experimental results that we performed on the proposed models including the baselines. It is quite familiar that multiple baselines approaches are helpful for comparing the performance of models on validation sets. To observe this, we apply various computational models on the training data released for Sub-task A so that we figure out which models give better results on the training data. Table 4 presents each model results in terms of Accuracy, F1(Macro), Precision, Recall and Roc-Auc score. It is evident from this Table that the deep learning models like LSTM, Bidirectional LSTM and Bidirectional GRU with GloVe word embeddings outperformed TF-IDF based machine learning algorithms. The LSTM model provides better results in terms of Accuracy among all the models. Bidirectional LSTM provides better results in terms of F1 macro and the Random Forest with TF-IDF gives better results in terms of Precision. The Bidirectional GRU provides better results for Recall matrix. The standard LSTM and Bidirectional LSTM performs equally good in terms of ROC-AUC.  Table 4: Results of the proposed deep learning approaches including baselines on the Sub-task A training data using 10-fold cross validation

Results for Sub-task A
The official results of our proposed models on the test set for Sub-task A is shown in Table 5. As it is evident from these results that Bidirectional GRU performed better than other two deep learning models with F1 Score of 0.69. To analyze the correct label of a tweet, we also show the confusion matrix which shows correct class predictions along diagonal lines. Our team ranked 74 out of 104 participating teams.

Class Label results
Besides the combined results of our proposed models on the test set for three Sub-tasks, we also present per class results in Tables 8, 9 and 10. The results in these tables show how well our models performed on each class label. Table 8 shows each class label results for Sub-task A which comprises of two classes offensive (OFF) and not offensive (NOT). Bidirectional GRU performed better on both classes with F1 score of 0.54 and 0.84 respectively validating the fact that offensive class are relatively difficult to classify. Table 9 shows each class label results for Sub-task B which comprises of two classes targeted input (TIN) and untargeted (UNT). Bidirectional GRU performed better on both classes with F1 score of 0.94 and 0.45 respectively which shows that untargeted class are much harder to classify.

Conclusion
In this paper, we address the challenge of identification of offensive tweets as well as their categorization. Our proposed approach comprises of three deep learning based techniques for efficient classification of offensive posts in social media. In this work, we show that applying word embedding over social media text followed by the application of a sequence to sequence models like LSTM, Bidirectional LSTM and Bidirectional GRU leads to a better classification of the text. This proposed approach can also be incorporated in an end-toend framework. Overall, our approach provides an efficient way of text classification in social media. For future work, we want to include characterbased embeddings along with pre-trained word level embeddings for better representation of text. Also, the addition of attention layer to the deep networks sometimes increases performance even further.    Table 10: Shows per-class performance of our proposed models for Sub-task C.