NLP@UIOWA at SemEval-2019 Task 6: Classifying the Crass using Multi-windowed CNNs

This paper proposes a system for OffensEval (SemEval 2019 Task 6), which calls for a system to classify offensive language into several categories. Our system is a text based CNN, which learns only from the provided training data. Our system achieves 80 - 90% accuracy for the binary classification problems (offensive vs not offensive and targeted vs untargeted) and 63% accuracy for trinary classification (group vs individual vs other).


Introduction
Background. Social media (e.g. Twitter, Facebook) is widely used today. For example, 68% of all Americans report owning a Facebook account in 2018 (Smith and Anderson, 2018), while 71% of Americans (within the ages of 18-24) report using Twitter. Online gaming is a second popular use of the internet, reaching upwards of 80 -100 million monthly users depending on game (Goslin, 2018). These uses demonstrate the internet as a way for humans to connect with others. However, connecting with others can carry a downside. More than 1 in 3 young people have been cyberbullied online (cyb, 2018), this extends to around half of teens. Besides cyberbullying, offensive language, on a public forum, can cause users to stay away from certain platforms. Because of this, companies have increased their efforts to remove offensive language from their platforms (Terdiman, 2018). With the large amount of traffic these platforms see, a purely manual approach to detecting/removing offensive language is impossible, which means an automated approach is needed to help. OffensEval (Zampieri et al., 2019b) provides a community driven opportunity to build such systems. We approach this problem of classifying offensive tweets with a CNN architecture trained on the provided training dataset.

Proposed Approach
Our system is a variation of a Convolutional Neural Network (CNN), which was chosen since it has seen success previously with classification tasks CNN Infrastructure. We experimented with two different Convolutional Neural Networks. The first whose architecture is based on the CNN originally proposed in (Kim, 2014) (CNN 1), and the second a combination of multiple CNNs (CNN 2). We further discuss each of these CNNs and their comparison on the training data. The visual structure for CNN 1 and CNN 2 can be found in figure 1 and figure 2 respectively. Preprocessing. Both CNN 1 and CNN 2 begin by preprocessing the text of the tweet. As noted in section 3, URLs and user mentions are already denoted as URL and USER. Basic cleaning of the text is applied, includes removal of punctuation, converting text to lowercase, and filtering of stopwords via NLTK's stopword list 1 . Finally, all separated words are tokenized via nltk's tokenize() function 2 . Embedding Layer. CNN 1 and CNN 2 encode the text of a tweet as a word embedding with dimension j. We experimented with several j arriving at j = 100. Word embeddings for Non-Out of Vocabulary (OOV) words are obtained from Glove (Pennington et al., 2014) which has been trained on Twitter data 3 . Experiments were also conducted with Glove common crawl data, but no visible improvement was found. OOV words are randomly initialized as a word embedding. The embedding layer takes in i word embeddings of length j, where the i word embeddings are combined in the same sequential order as they appear in the tweet. We choose i as the length of the Embedding Layer "@USER you are a lying corrupt traitor!!! Nobody wants to hear anymore of your lies!!!" Input (Tweet) longest tweet (i.e. number of words after preprocessing). Any tweets less than i length are padded with zero embeddings at the end. CNN 1 and CNN 2 differ at this point and will be examined separately.
CNN 1 Convolutional Layer. CNN 1 applies three k × j convolutional windows to the embedding layer: a 3 x j, a 4 x j, and a 5 x j window. Applying each window to the embedding layer results in a (i − k + 1) × 1 output, where k = {3,4,5} and corresponds to the length of the window. 100 filters of each window are applied to the embedding layer resulting in 100 (i − k + 1) × 1 outputs for each k.
CNN 1 Max pooling/Merge Layer A max pooling of size (i − k + 1) × 1 is applied to each separate filter output from the convolutional layer. The resulting outputs from all three max pooling streams are merged together then flattened to a 300 neuron layer.
CNN 1 Dense Layer/Output Layer The flattened layer is fed into a dense layer consisting of 128 neurons. ReLu is chosen as the activation function. Finally, the output of the dense layer is passed to the output layer of size n where n = number of classes. The output layer uses a softmax function as activation.
CNN 2 Convolutional Layer. CNN 2 applies three sets of three convolution windows to the embedding layer, each window in the format k × j.
CNN 2 Max pooling/Merge Layer. Max pooling, in this instance, behaves similarly to CNN 1 . However, instead of merging all the pooled layers, only those in the same set are merged. This results in three separate flattened 300 neuron layers.
CNN 2 Separate Dense layers The three separate merged layers are fed through two dense layers consisting of 128 neurons each. ReLu activation function is used for all dense layers. Each merged layer is fed to their own respective dense layers. The second dense layers are finally merged together.
CNN 2 Final Dense/Output Layers The merged layer is fed through two more 128 dense layers with ReLu activation. Finally, the result is Hyperparameters/Training We experimented with different epochs and batch sizes and ended up finding epochs=30 and batch size = 50 worked best for our models. The only data trained on was the training data provided. More on this data in section 3. The system was implemented with Keras 4 and Tensorflow as the backend.

Dataset
Training Set. The data collection methods used to compile the dataset provided in OffensEval is described in Zampieri et al. (2019a). The training data set provided consists of 13,240 tweets. Each tweet, consists of up to three classifications, which correspond to subtasks further described in section 4. The classifications are as follows: vi. GRP -The target of the offensive post is a group of people considered as a unity due to the same ethnicity, gender or sexual orientation, political affiliation, religious belief, or something else.

Subtasks
OffensEval divided the overall task of identifying/classifying offensive language into three subtasks, subtask A, B, and C. Subtask A. Subtask A requires a system to classify tweets as either offensive (OFF) or not offensive (NOT). An example of a tweet marked as OFF (in provided training): @USER you are a lying corrupt traitor!!! Nobody wants to hear anymore of your lies!!!.
An example of a tweet marked as NOT: @USER Buy more icecream!!!. A more expanded look at the training data can be found in section 3. Subtask A Results. As our system only trained on the provided gold standard, this data set was used to gauge the effectiveness of our two systems. Five fold cross validation was used for predicting training data. The results for subtask A on training data can be found in table 2. CNN 1 and CNN 2 achieve similar results, an accuracy of 0.7468 and 0.7555, and a macro F1 score of 0.7130 and 0.7114, respectively. The results for our systems' performance on OffensEval test data subtask A can be found in table 3. As with the training data, CNN 1 and CNN 2 perform similarly on this task, with CNN 1 achieving 0.8 accuracy and a macro F1 score of 0.73. Subtask B. Subtask B requires further classification of OFF tagged into two categories, targeted (TIN) and untargeted (UNT). An example of untargeted tweet is:   @USER @USER My favourite part of this is watching all the conservatives lose their minds as usual. Once again the Democrats a re being mean to us boo-hoo. LOL.
An example of targeted: @USER You are a complete knob! It's ppl like you who are messing up this country. More details on data in section 3. Subtask B Results. The results for cross validation on subtask B's training data are found in table 4. CNN 2 achieves a greater accuracy over CNN 1 on this subtask, 0.8723 compared to 0.8222, but still achieves a smaller macro F1 score of 0.5673 compared to 0.5827. Subtask B test results for our systems are found in table 5. Similar to training, CNN 2 outperforms CNN 1 in accuracy, 0.8958 to 0.8750, but achieves a similar trend in macro F1 scores, 0.6511 and 0.6528. Subtask C. Subtask C requires further classification of those tweets tagged are targeted (TIN), into three classes, Individual (IND), Group (GRP), and Other (OTH). Examples: IND tweet -@USER You are a complete knob! It's ppl like you who are messing up this country GRP tweet -@USER Assuming liberals are unarmed would be a grave mistake by the deplorables.

System
Acc.
Pr. Re.    OTH tweet -@USER Shooting in USA is so common no one is talking about gun control any more.
More details on subtask C data in section 3.
Subtask C Results. The results for subtask C's cross validation on training data can be found in table 6. CNN 1 slightly outperforms CNN 2 in this task, achieving an accuracy of 0.6925 over 0.6772 and a macro F1 score of 0.5282 over 0.5140. Subtask C's test data results can be found in table 7. As subtask C is a three class problem (compared to the two class problem of A, B), the accuracy and macro F1 scores are lower overall. As with training, CNN 1 slightly outperforms CNN 2 in both accuracy, 0.6291 to 0.6197, and macro F1 score, 0.5061 to 0.4939.

Discussion
Systems outperform single labels. On test data, for all three subtasks, our system outperforms the baseline, provided by organizers, for assigning a    single label for all tweets. Outperforming the best baseline (labeling tweets with highest frequent label) in terms of F1 by 0.32 in subtask A, 0.18 in subtask B, and 0.29 in subtask C. Outperformances in accuracy are seen in all three subtasks as well.
Results follow data distribution. The test confusion matrices for the higher scoring system (CNN 1), for subtask A, B, and C, can be found in

Related Work
Offensive language detection and classification has become increasingly relevant in recent years with the rise of social media. Subsequently, researchers have also begun to look at aggression, cyberbullying, hate speech, and abusive language identification.
Cyberbullying. Cyberbullying detection has been approach by several teams. Dinakar et al. (2011) show that binary classifiers for individual labels outperforms multi-label classifiers. Xu et al. (2012) demonstrate that social media is a rich environment for studying cyberbulling with NLP. Dadvar et al. (2013) show the effectiveness of including context around a comment. Abusive Language. Abusive language has also seen increase in study. Nobata et al. (2016) construct a machine learning algorithm and test with different lexical features, outperforming at the time state-of-the-art methods. Mubarak et al. (2017) expand abusive language identification to Arabic social media. Fišer et al. (2017) propose a legal framework, dataset and anotation schema for abusive online language in Slovene. Su et al. (2017) propose a system which can not only detect, but also rephrase abusive language in Chinese. Waseem et al. (2017) propose breaking abusive language identification into further subtasks. Founta et al. (2018) leverage crowd sourcing to produce a large (80,000) annotated data set of abusive Twitter language. Hate speech. Hate speech identification has come to the forefront for research as it deals with current hot button issues (e.g. racism, sexism). Schmidt and Wiegand (2017) and Fortuna and Nunes (2018) compile a surveys of current hate speech detection. Other teams have brought to question how we view and handle hate speech. Ross et al. (2016) show the difficulty of annotating hate speech and propose handling classification as non-binary. Malmasi and Zampieri (2017) establish lexical baselines for hate speech detection by applying supervised classification methods and Malmasi and Zampieri (2018) show the problems which can arise when distinguishing profanity from hate speech. ElSherief et al. (2018) further look to understand hate speech by looking into the target of hate speech (i.e. at a individual or more general group).
Machine learning classifiers are leveraged in this field as well. Kwok and Wang (2013) employ a machine learning classifier to identify racist tweets. Burnap and Williams (2015) test a machine learning system on different n-gram features to identify hate speech on Twitter. Tulkens et al. (2016) use hate speech dictionaries along with support vector machines to identify racism on Dutch social media. Schofield and Davidson (2017) demonstrate three standard methods for producing features for text classification, targeting specifically the problem of automatic hate speech identification.
Subsequently, deep learning is seen in hate speech detection as well. Djuric et al. (2015) train and leverage comment embeddings to help identify hate speech. Gambäck and Sikdar (2017) show the effectiveness of convolutional neural networks (CNN) when identifying hate speech. Zhang et al. (2018) identify hate speech using a convolution-GRU based deep neural network.
Offensive Language. Offensive language identification aims to broaden the scope of negative language identification. Wiegand et al. (2018) proposed and ran a GermEval task similar to OffensEval, which had participants classify offensive language as offensive or other, then further classify the offensive tagged language.