TECHSSN at SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Tweets using Deep Neural Networks

Task 6 of SemEval 2019 involves identifying and categorizing offensive language in social media. The systems developed by TECHSSN team uses multi-level classification techniques. We have developed two systems. In the first system, the first level of classification is done by a multi-branch 2D CNN classifier with Google’s pre-trained Word2Vec embedding and the second level of classification by string matching technique supported by offensive and bad words dictionary. The second system uses a multi-branch 1D CNN classifier with Glove pre-trained embedding layer for the first level of classification and string matching for the second level of classification. Input data with a probability of less than 0.70 in the first level are passed on to the second level. The misclassified examples are classified correctly in the second level.


Introduction
The growth of social media networks in recent days has been phenomenal and Twitter is no exception. However, this rapid growth of social media also poses a serious challenge of maintaining ethics in social media because of the degradation of moral values in society. Offensive micro tweets are generated on a daily basis targeting a particular person, organization, race, caste, community, religion, gender and so forth. For this reason, our task for the SemEval2019 (Zampieri et al., 2019b) mainly focuses on the detection of offensive language in tweets and classify them.
In Subtask-A, we tried to classify the tweets into two classes, namely offensive and non-offensive. The offensive tweets in the Subtask-A are then categorized in Subtask-B into targeted and untargeted tweets, where targeted tweets are aimed at a specific person, organization, religion or political parties. Further, Subtask-C deals with the finegrained classification of offensive tweets into three classes viz person, organization and others.
The training dataset provided by the organizers contains 13240 tweets. The given dataset is used as the preliminary dataset to train our model. In addition, Impermium dataset from Kaggle and TRAC dataset are added to improve the accuracy of the model. We have also used a dictionary of offensive words in the second level of classification. Manually classifying the tweets is ambiguous and highly subjective, and is one of the biggest challenges. The mix of colloquial slang in tweets, veiled references, missing data, usage of symbols and emojis are further hurdles that lowered the prediction accuracy (Founta et al., 2018).

Related Work
Many researchers in the field of Artificial Intelligence and Natural Language Processing have been working to detect offensive speech in tweets using sentiment analysis. Pang et al. (2002) used a three level classification system with Naive Bayes classifier in the first level, Multinomial Updatable Naive Bayes in the second level and a rule based classifier named DTNB in the third level. The second level of classification increased the accuracy by 7% while the third level results showed an improvement in performance by a 6% rise in accuracy. The results were boosted by the usage of an insulting and abusive language dictionary. Stammbach et al. (2018) developed an ensemble model of Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and CNN+RNN that gave a macro averaged F1-score of 77.7%, 78.6% and 77.6% respectively on 10-fold cross validation. It is stated that the correct words generated by the spell-checker did not occur in the embeddings and this might be one of the reasons for the low performance of the model. Che et al. (2017) presented a review of the recent deep neural networks used for text classification and a comparison of word embeddings to Support Vector Machine (SVM) classifiers. It points out that a critical issue with CNN is the restriction of a fixed input size and hence the inability to handle sentences of variable length as input, and therefore, focuses on RNN. It is also mentioned that n-grams with syntactic and semantic information achieve significantly better results than the standard n-grams. (Le and Mikolov, 2014) uses paragraph vectors instead of Bag-of-Words (BOW) feature representation and reduces the classification error by approximately 39%. Dinakar et al. (2011) andElSherief et al. (2018) discuss the problem of overlapping classes in target identification.  discuss the various challenges in discriminating hate, offensive and non-offensive text using ensemble techniques and stacked generalization meta learning methods. Razavi et al. (2010) discuss about the multilevel classification for flame detection using complement naive Bayes on first level to select discriminative features and multinomial updatable naive Bayes classifier on second level to enhance the model with new features for adaptive learning. They have used rule-based classifier named DTNB (Decision Table/Naive Bayes hybrid classifier) as the last level to classify the text into Flame/Not.

Methodology and Data
The task of classifying offensive tweets is difficult as it needs to discover the intention of the user. Moreover, people who follow the chatting convention use offensive words to express their feelings. The architecture diagram for the offensive text classification is shown in Figure 1.

Acquiring Datasets
The classifier can make a well-informed decision if we procure and supply more data to it. For good performance, deep learning requires sufficiently large amount of data. Therefore, in addition to the dataset given by Zampieri et al. (2019a), we also compiled a variety of datasets for our tweet classification. We have added the TRAC training   (Kumar et al., 2018) consisting of online posts from Facebook, the Impermium Dataset of Kaggle (Impermium, 2013), and used a comprehensive list of known offensive words banned by Google in all their different forms. TRAC dataset is based on multiclass classification (OAG, CAG, NAG). We have considered OAG and CAG class labels as OFF label and NAG as NOT for the subtask A.

Data Preprocessing
Data preprocessing critical for the success of any machine learning solution. The given dataset shows many signs of irregularities which is a classic signature of any collection of tweets. Normalizing the data involves flattening the dimensions of data into textual form. The dataset is cleaned and processed using functions from NLTK and spacy toolkit.
During preprocessing, we (a) remove URLs, (b) annotate emojis, emoticons, (c) convert uppercase to lowercase, (d) expand contractions, (e) remove stopwords, (f) remove special characters, (g) remove accented characters, (h) reduce lengthened words, (i) lemmatize text, and (j) remove extra whitespace Among the steps listed above, step (e) is omitted for subtask-B and subtask-C, since stopwords are significant for target identification.

Model Description
We classified the text using these three models and compare the results: 1. 2D-CNN with Word2Vec Learned Embeddings 2. 1D-CNN with GloVe 3. SVM with BOW

2D-CNN with Word2Vec Learned Embeddings
Even though it is unconventional to use a two dimensional convolutional network to work on text sequences, it has proved its usefulness quite satisfactorily (Prusa and Khoshgoftaar, 2017). The model is built by taking the pre-trained Google's Word2Vec weights and learn more with the following additional layers. We have used an embedding layer which learns the weights of the embedding matrix during training. The bigrams, trigrams and fourgrams of the words are obtained by applying filters and kernels of the right size on the embedding layer output. After applying the filters, a max pooling operation is performed on each CNN layer to scale down the output vectors into dense feature vectors, each of size 100. The three dense vectors are concatenated and flattened into a single dense vector of size 300 in the fully connected layer. The final output is obtained through a dense layer with 2 output units. The main ideabehind the concatenation of the word grams is to compute n-grams in parallel, add them together and extract as much information as possible from the vectors. This enables the classifier to learn and understand the relationships between the underlying words. The parameters for the model are as set as follows: sequence length of the model is 43, learning rate is set as 0.001 and dropout is set as 0.5, Softmax activation function for output layer and Relu activation function in other layers.

1D-CNN with GloVe
GloVe embeddings with 1 million word vectors of 200 dimension from twitter is used as an alternative method to create the embedding matrix of the embedding layer in the network. We used a conventional convolutional neural network in single dimension and extracted the skip-grams in parallel by using filters of size 2, 3 and 4, each vector of size 100, in three different branches and concatenated them to form one flattened dense vector of size 300. We have used 100 filters and dropout value as 0.2. Softmax activation function is used in output layer and Relu function is used in other layers. To increase the representational power of the neural network, a couple of dense layers are added before the final output layer. This model has fewer trainable parameters, takes less time to train, yet performs on par with 2D-CNN.

SVM with Bag-Of-Words
We implemented a Support Vector Machine using Term Frequency-Inverse Document Frequency (TF-IDF) model and found it to be as good as neural networks. The input document is converted into binary vectors using TF-IDF method. These feature vectors are fed as input to the SVM which uses a linear function as the kernel. The confusion matrix for SVM is shown in Table 1. Linear SVM is found to be better for text classification since most of the text classification problems are linearly separable (Joachims, 1998

Logistic Regression and RNN Models
We also trained Logistic Regression (Davidson et al., 2017) and RNN + LSTM (Long Short-Term Memory) models (Pitsilis et al., 2018). Table 2 shows macro average F1-scores for the various models developed. These models do not perform well compared to CNN and SVM. We used 75%  Table 2: Comparison of F1-scores of the Models Used of the instances for training and 25% for testing the accuracy of the models. The Logistic Regression using count vectorization model predicts more offensive tweets correctly than the previous models specified here. However, since the model could not classify nonoffensive tweets effectively, its performance is low. The reason is that the bag of words model using count vectorization does not take the context or the semantics of the tweet into account and hence contextually non-offensive tweets with an offensive word in them are misclassified. Table 3 shows the confusion matrix for logistic regression model.

Second Level Classification for Subtask-A
Most of the models we have described in the paper classify non-offensive tweets more correctly than offensive ones. To overcome the misclassification of offensive tweets, a second level of classification using string comparison model is done. Tweets predicted offensive with a probability less than 0.70 are passed on to a string comparison model which checks for any occurrence of offensive words. As an aid, a dictionary of 1384 words is constructed with the offensive words banned by Google and a list of bad words. This two level classification system proves to be very effective and increases the performance of the system by 5% in the macro average F1-score. The second level of classification is used to find whether the tweet is offensive or not. There is no need for a second level in subtasks B and C since they do not involve this classification. Table 4 shows the accuracy (Acc) results for subtask A. Three models were developed and tested for the given dataset. 2D-CNN model with enhanced dataset has better F1 score for both offensive and non-offensive tweets in comparison to the 2D-CNN with given dataset alone and 1D-CNN. 1D-CNN has better accuracy than other models. Since the given dataset is biased, some models classify non-offensive tweets more correctly than offensive tweets. Such a model has better accuracy than other models, but lower F1 score.      The confusion matrix for the best model in subtasks A, B and C are shown in Figure 2, 3 and 4 respectively.

Conclusion
We have built three models for the tasks of offensive language detection and classification in tweets. The models are 2D-CNN with Word2Vec learned embeddings, 1D-CNN with GloVe and SVM with TF-IDF. All the models use data pre-processed with NLTK, which we think is an important factor for improved accuracy. In addition, we also made use of a dictionary of offensive and banned words for a second level of classification.
Meaning of a tweet varies with an individual's perception and cannot be judged by simple conventional models. This is one reason for the reduced precision of classification. The concepts of irony, sarcasm, humor and other tones of a conversation are too intuitive and implicit for the models to detect them accurately. We intend to investigate further by adding multiple hidden layers and building complex network structure which will, in parallel, look for the tell-tale signs of the target tone of the tweets. 1D-CNN model achieved less F1-score in the target identification (subtask C) than in subtasks A and B, due to smaller dataset, which can improved by augmenting the dataset.