Attending the Emotions to Detect Online Abusive Language

In recent years, abusive behavior has become a serious issue in online social networks. In this paper, we present a new corpus from a semi-anonymous social media platform, which contains the instances of offensive and neutral classes. We introduce a single deep neural architecture that considers both local and sequential information from the text in order to detect abusive language. Along with this model, we introduce a new attention mechanism called emotion-aware attention. This mechanism utilizes the emotions behind the text to find the most important words within that text. We experiment with this model on our dataset and later present the analysis. Additionally, we evaluate our proposed method on different corpora and show new state-of-the-art results with respect to offensive language detection.


Introduction
Nowadays, abusive behavior has become a serious issue in online societies (Jones et al., 2013;Ybarra and Mitchell, 2004).Unfortunately, such behavior can have serious effects on the physical, mental and social health of the younger generations 1 .In recent years, there have been several efforts to automate the detection of offensive language across social media platforms.Traditional approaches using lexical features have been proven to work quite well for this task (Dinakar et al., 2012;Davidson et al., 2017); however, these types of features can introduce some bias into the system by focusing on profane words, whereas the reports show that most of profanities are used in a neutral way in today's teen talks (Samghabadi et al., 2017).The following examples indicate that profane words are not a good criterion for filtering abusive content anymore: 1 https://enough.org/stats_cyberbullying Neutral: Damn you are such a BEAUTIFUL F*CKING MOMMY!Offensive: stop sending questions to yourself pretending you're someone else you weirdo!Most of the resources available for this task have been created based on either bad words, or seed words related to abusive topics and do not cover implicit forms of abusive language.In this paper, we propose a method to create a new corpus without focusing on the bad words, and still have a reasonable number of offensive instances.We collect our data from Curious Cat2 , a semi-anonymous question-answering website, that has increased in popularity among teenagers and youth.The anonymity option available on Curious Cat opens the door for digital abuse.On this website, users can choose not to reveal any personal information on their account, as well as post comments/questions on other users' timelines anonymously.Due to these properties, we are limited on both the content of a post (since the posts are almost too short in length), as well as the information about the sender of that post.
To overcome the aforementioned challenges within the data, we propose a single deep neural architecture that only employs the textual cues from the input text to decide whether it is offensive or not.Along with this model, we introduce the Emotion-Aware Attention (EA) mechanism that dynamically learns to weigh the words based on the emotions behind the text.We use this method to reduce the model bias towards the bad words.Our main contributions in this paper are as follows: • We create a new corpus for the task of offensive language detection, which is not biased towards profane words.
• We propose a neural architecture that captures both local and sequential information from the text to predict whether it is offensive or not.Along with this model, we introduce a new attention mechanism that incorporates emotional information from the text for computing attention weights to find the most important words in the text for the task of offensive language identification.We show that our stacked CNN-BiLSTM model with EA outperforms several strong baselines and also the state-of-the-art across multiple corpora.
• We show the effectiveness of our proposed attention model over the regular attention by visualizing the attention weights.We also do an analysis over the mistakes of the model to find if there is any room for improvement.

Related Work
Abusive language identification and hate speech detection have been addressed by many research papers.Most of the related work have employed feature engineering approaches, and use a combination of different types of lexical, syntactic, semantic, sentiment and lexicon-based features along with classic machine learning algorithms such as Support Vector Machines (SVM), and Logistic Regression (Schmidt and Wiegand, 2017;Gitari et al., 2015;Van Hee et al., 2015;Davidson et al., 2017;Nobata et al., 2016;Wiegand et al., 2018).Due to the popularity of deep neural networks, multiple studies have recently been conducted in order to explore the performance of these models on the task of aggression identification.Most of these studies are focused on hate speech detection within Twitter.
Gambäck and Sikdar (2017) use a Convolutional Neural Network (CNN) based model, and investigate different textual and embedding features as the input to the model where word2vec produces the best results.Badjatiya et al. (2017) conduct an extensive evaluation on multiple traditional and deep learning approaches, and report the best results using an ensemble of LSTM and Gradient Boosted Decision Trees.Mishra et al. (2019) model the network of users with a Graph Convolutional Network (GCN) to learn the structure of online communities along with the linguistic behaviors of the users within them.They report the state-of-the-art results on Waseem and Hovy (2016) Twitter data by passing the produced hidden representation by GCN to a Logistic Regression classifier.
Our model has two major differences with other existing methods: (1) We do not have access to the user-level information due to the nature of the data we work on, and (2) Instead of using an ensemble approach, we propose a single deep neural architecture that shows very promising results across multiple resources.

Available Resources
In this paper, we make use of the following available resources to evaluate our method: ask.fm (Samghabadi et al., 2017): Created based on the most frequent profane words in ask.fm and contains around 6K question-answer pairs, where each question and each answer are labeled as positive/neutral or negative.Wikipedia personal attacks dataset (Wulczyn et al., 2017): Includes over 115k labeled discussion comments from the English Wikipedia.This dataset was annotated via Crowdflower annotators, where each label shows if a comment contains a personal attack.Kaggle insult dataset3 : Released in 2012 for the shared task of "Detecting Insults in Social Commentary" hosted by Kaggle, and contains around 6K posts on adult topics like politics, military, etc.

New Resource
We collected our own data from Curious Cat which is a fairly new semi-anonymous, questionanswering social media platform, like ask.fm.Curious Cat has been steadily increasing in popularity among teens and pre-teens, and has more than 12 million registered users.This site allows users to post anonymously on the other users' timelines.
We have crawled around 500K English question-answer pairs from 2K randomly chosen users in Curious Cat.Regarding the annotation process, to avoid having bias in our data, we decided not to focus on profanities.So, instead of using either a dictionary of bad words, or setting seed words related to bullying traces to find the potentially offensive messages, we have chosen to apply a pre-trained classifier with reasonable performance on the other resources.Since the format of the Curious Cat data is similar to ask.fm, we decided to use the state-of-the-art classification method on ask.fm (Samghabadi et al., 2017).We pre-trained the classifier on the full ask.fmdataset and applied it on Curious Cat in order to automatically label all rows of data.Although ask.fm and Curious Cat have the same format, we notice key differences between these two sources of data, which may affect the quality of automatic labeling substantially.For example, ask.fm data was created based on profane words, so we expected the classifier that was trained on this data to be sensitive to some bad words.However, with Curious Cat, we observed numerous sexual posts that are full of bad words, yet not offensive to the user.In fact, some users encourage others to continue posting sexual comments, like the following example: Question: I wanna s*ck your d*ck so hard and taste your c*m.Answer: Enter my DMs beautiful.Therefore, we created the primary version of our data by randomly selecting 2,482 question-answer pairs, where 60% were chosen from the negative labeled data, and 40% chosen from the positive labeled data (we only considered the label of the questions).Then, using a two-way annotation scheme, we asked four annotators to annotate each row of the data to finalize the labels.Table 1 shows the final distribution of the proposed corpus.The average inter annotator agreement kappa score is 0.499 which shows a moderate agreement among the annotators.It is also interesting to see that 95% of negative comments were posted on users' timelines anonymously.Table 2 compares the four different resources that we use in this paper.We will make our dataset available to the public when the anonymity is not a concern.

Methodology
Our proposed model to detect offensive language contains three different modules.The first one uses a Convolutional Neural Network (CNN) to learn the text representation based on local information.CNNs can extract the lexical features, which have been proven to benefit the task of ag-  gression identification.The second module extracts the sequential information from the text via a Bidirectional Long Short-Term Memory (BiL-STM), and uses our proposed Emotion-Aware Attention (EA) mechanism to measure the importance of each word based on the emotions that the text conveys.Essentially, this module extracts the contextual information from the text.The last module is a sequential layer that is applied on top of the output representations from the first and the second modules.It aggregates the lexical and contextual information in order to decide whether the input text is offensive or not. Figure 1 shows the overall architecture of the proposed model.

Embedding Layer
The input layer of our model is an embedding layer, which takes a sequence of words, extracts the embedding vector for each word, and generates the corresponding embedding matrix for the given input text.We use 200-dimensional Glove4 embeddings pre-trained on Twitter5 .We also experimented with ELMO (Peters et al., 2018) and BERT (Devlin et al., 2018) contextualized embeddings, but the results were not as competative as Glove embeddings.

Convolutional Neural Network (CNN)
Several studies on abusive language identification and hate speech detection show promising results with the use of word-level CNN models (Aroyehun and Gelbukh, 2018; Zhang et al., 2018;Zhang and Luo, 2018).Our CNN model takes a sequence of word embedding vectors as input.It contains four stacked 1-dimensional convolution units with 100 filters and different filter sizes of 2, 3, 4 and 5 to extract word n-gram features (Kim, 2014).For each convolution unit, we use ReLU as the activation function, and apply a max pooling operation to take the maximum value as the output feature.Then, we concatenate the outputs for all convolution units and feed the resulting representation to a dropout layer to avoid overfitting.

Emotion-Aware Attention Model (EA)
In addition to the CNN model, we also pass the input embedding vectors to a Bidirectional LSTM (BiLSTM) layer to model the sequential information of the text.Another input to this module is a single emotion vector that is extracted from the input text.For capturing the emotion from the text, we decide to use the DeepMoji model (Felbo et al., 2017) pre-trained on Twitter data.This model creates a representation that contains 64 frequently used online emojis that shows how relevant each emoji is to a given text.We also experiment with NRC emotion lexicon (Mohammad and Turney, 2013), but it does not seem to work well with online short texts because of two major limitations: (1) For short texts, the generated vectors are too sparse since the lexicon does not cover many words, and (2) Several words are assigned to more than one category and it would be confusing in the case of short text, since we do not have that much context to decide which emotion is dominant.Inversely, emojis provide us with fine-grained emotional categories that makes this decision easier.
To prepare the emoji vectors for our model, given a text, we tokenize it to sentences (if it contains more than one sentence).Then, we extract the DeepMoji vector for each sentence and calculate the average vector per post.Ultimately, we make a binary representation that assigns 1 to the five most probable emojis, and 0 to the others.As another input to our model, we pass this emoji vector through a non-linear layer to project it into the same space as the output from the BiLSTM model.Then, both word and emotion representations for the text are fed to the attention model.
The motivation for the EA mechanism is the model presented in Maharjan et al. (2018).In that paper, the authors proposed a genre-aware attention model that uses genre information to find the most appropriate set of features for predicting the likability of a particular book.Similarly, we hypothesize that it is not enough to just focus on the word representations in the attention model; because of two reasons: (1) Many bad words are also used in neutral way to make jokes and provide compliments among friends, and (2) Some texts do not contain any profanity, but are still offensive to the receiver.Both reasons may mislead the model to predict the correct label.Therefore, we design the EA mechanism to work not only based on the word representations, but also the emotions behind the whole text in order to better distinguish the most important words of a document.
Lets assume that is the concatenation of the forward and backward hidden states of BiLSTM, and e is the emoji vector.To measure the importance of words, we calculate the attention weights α i as follows: where the score(.)function is defined as: where W a and W e are weight matrices, and b and v are the parameters of the model.W e is shared across the words and adds emotion effects to the attention weights.The output of the attention layer is the weighted sum r calculated as follows: Finally, we concatenate the output of the attention model with the input emoji vector to further consider the direct effect of the emotions on the model.

Sequential Layer
We concatenate the document representations produced by the two above mentioned modules.The resulting vector is then fed into a hidden dense layer with 100 neurons.To improve generalization of the model, we use batch normalization and dropout with a rate of 0.5 after the hidden layer.Finally, we use a two neuron output layer along with softmax activation to predict whether the input text is offensive or not.
5 Experiments and Results

Preprocessing and Experimental Setup
For the Curious Cat dataset, we stratified split the data into train and test sets with a 70:30 training to test ratio, and use 20% of the training data as the validation set.For the other available corpora, we use the same train, validation, and test folds as used by Samghabadi et al. (2017).As for the preprocessing step, we proceed to lowercase the texts and replace all of the links and user mentions with the words "url" and "@username" respectively.We also truncate the posts to 200 tokens, and left-pad the shorter sequence with zeros.
We use Binary Cross Entropy to compute the loss between predicted and actual labels, and train the network using Adam optimizer (Kingma and Ba, 2014) with a learning rate set to 1e −5 .We train the model over 150 epochs, and report the test results based on the best macro F1 obtained from the validation set.

Baseline
We compare our model against the state-of-the-art, as well as several strong baselines listed bellow: Emoji Baseline: We use the emoji vectors as the input to this model and directly pass them to the sequential and output layers.CNN: We feed the output of the CNN module to the sequential and output layers to predict the labels.We also consider concatenating the emoji vectors with the CNN output.BiLSTM + Regular Attention (RA): Similar to our main model, we use the same BiLSTM module, but calculate the score(.)function without considering emoji vectors using the following formula: v T tanh(W a h i + b a ).We then feed the resulting representation to the sequential and output layers.BiLSTM + EA: We use the exact same BiLSTM module as what we have in our main architecture and pass the results to the sequential and output layers.CNN-BiLSTM + RA: This model is similar to our proposed model, but utilizes the RA mechanism instead of EA.Sam'17: This model is the state-of-the-art for the ask.fm corpus and is presented in Samghabadi et al. (2017).It makes use of a combination of several textual features as the input to an SVM classifier.Mishra'18: This model is presented in Mishra et al. (2018) and reported the state-of-the-art results for the Wikipedia dataset.It learns a context-aware representation for characters by concatenating the one-hot character vectors within a document.The resulting representations are then fed to a BiLSTM module with tanh activation, and passed through the output layer.

Results
For the evaluation, we use the F1 score for the negative/offensive class, since this is the class of interest.We also report the macro F1 score, which calculates the average performance over both classes.This is to ensure that the model does not sacrifice the positive/neutral class to increase the performance of the negative class.
Table 3 shows the classification results for all of the resources, where we obtain the best results with our stacked CNN-BiLSTM model for Curious Cat, ask.fm and Kaggle data. Overall, compared to Samghabadi et al. (2017), all of the improvements with our model are statistically significant under the Mcnemar significance test (p − value < 0.001) for all of the resources.Our model outperforms the-state-of-art for the ask.fm dataset by about 1.5% with respect to the macro F1.For the Kaggle corpus, we also compare our results against the winner of the Kaggle competition which has the AUC (Area Under ROC curve) score of 0.8426 .The obtained AUC score for this dataset with our model is 0.913, which shows an improvement of 7%.Only for the Kaggle dataset, the performance of the EA model is slightly worse than the RA.We believe that this is due to the diverse nature of data that we have in different datasets.We will further dig into this analysis in Section 8.For the Wikipedia data, we ob- serve that the CNN module obtains the best results (around 1.5% better than the state-of-the-art F1 score); however, the difference (0.1%) between the performance of this model and our proposed stacked CNN-BiLSTM is not significant based on the Mcnemar test.
Based on Table 3, the CNN module performs better than the BiLSTM module across all of the resources, which proves that the lexical information plays a significant role with respect to offensive language identification.We could also observe that concatenating emoji vectors with the final hidden representation clearly boosts the performance of the system, which confirms our assumption that using the emotional information behind the text benefits the model.

Why Do the Emoji Vectors Help?
Figure 2 shows the emoji distribution over the neutral and offensive classes for the Curious Cat training data.To create this plot, we use the average DeepMoji vector extracted for each instance that shows the relevance of each emoji to a specific comment.We create the overall emoji vector per class by averaging the emoji vectors extracted for all of the instances of the same class.Finally, we choose 19 out of the 64 emojis used in the Deep-Moji project to create the plot shown in Figure 2. The fact that there are different patterns visible for the neutral and offensive classes validates our hypothesis on why it is useful to incorporate emoji information to the model.
Based on Figure 2, angry emojis ( , , ) are highly correlated with the offensive class, inversely happy and love faces ( , , ) appeared more frequently in the neutral class.For the happy and love faces and , the difference between offensive and neutral classes is much less.We believe that this represents the scenarios where a defender (a user who defends the victim of online attacks) tries to support an attacked user by complimenting him/her, while expressing their hatred towards the attackers.Sad faces ( , , , , ) are more frequent in neutral instances, which may show the cases where a user expresses his/her unhappiness in response to an attack.It is very interesting that the laughing face, , shows a higher probability for the negative class.This can be linked to the scenario where someone attempts to bully a user by humiliating him/her.Additionally, the plot shows exactly the same probabilities for the poker face ( ) over the offensive and neutral classes.So, we can conclude that this emoji does not convey any additional information related to offensive language.Other emojis ( , , , and ) also seem to frequently appear in the offensive class that indicate the violent and threatening behavior towards the user.Looking at Ex1, Ex2 and Ex3, we can see that our model captures these differences.For Ex1, the angry faces are top-rated for the comment.Although, there is no other profane word in the sentence, the model seems to correctly focus on the phrase attention seeker.Inversely, for Ex2, , and are listed as the most probable emojis, and the model also weigh the negation word don't, and the positive adjective pretty.Ex3 shows an instance where the word f*cking stands for sexual activities.Top-rated emojis for this comment include , and that indicate sexually playful language.We can see that the attention model correctly focuses on the other sexual-related words as well.

Attention Visualization
Ex4 and Ex5 focus on the word die.The first example is obviously offensive towards the receiver.Again, we can see the angry faces plus are extracted as dominant emojis for the text.In this case, I hope is also highlighted which seems to trigger the emotions involved in the comment.However, in Ex5, the model also attends to the words Please and dont which change the emotional direction of the comment to and and emojis.This instance also illustrates that our EA mechanism is able to capture negation in the text.
The word ugly is sometimes used in offensive comments.In our data, we look for examples that only contain this word as the single profane word in order to check whether our model can distinguish between the use of this word in an offensive and neutral way.Ex6 shows a very offensive comment, even though we cannot see any intense bad word in it.It is interesting that our model attends the word black, but not the negation words (not, does not).The top emojis extracted from this text include angry, disgusted and poker faces.Ex7 illustrates the case where the user confirms what the harasser already posted with the hope to prevent further attacks.Unlike the previous example, the dominant emojis for this comment are sad faces along with .It seems that the model captures the self-targeted offensive language by giving a high weight to the word me as well as ugly.
With respect to the word annoying, EA also gives the attention to the words/phrases you (Ex8), and white person (Ex9), which enables the model to decide if the comment targets a specific person or not.Top rated emojis for Ex8 include the angry faces, but for Ex9 emojis like , and are dominant.
Via Ex10 and Ex11, we could observe that our model is able to distinguish the posts where an intense bad word like sl*t is utilized for offending someone ( , , ) or in a sexually playful conversation ( , ).It is interesting that in Ex11 the weight which is assigned to word my is greater than the second person pronouns.
Ex12 displays an instance where there is no bad word in the text, but it is still offensive to the receiver.This example is even more challenging since the comment includes the positive word thanks.The attention heat map shows that the top emojis like , and help the model to learn the negative load of the word deactivate.Figure 3 shows the emoji distribution for the 15 most frequent emojis over the correctly and incorrectly classified instances of Curious Cat test set with our best model.For creating this plot, we calculate the average emoji vector per class for both categories, and we only consider the probabilities for the top five emojis per instance.Based on the two subplots, there is an inverse pattern for some emojis ( , , , , , , , and ) across the neutral and offensive classes for the correctly and incorrectly classified instances.This may account for most of the classifier's mistakes.We believe that this could be an error which propagated from the DeepMoji model to ours.Besides, comparing the range of the probabilities between the two subplots, the probability scores for the emojis assigned to incorrect predictions are much less than the correct ones.It shows that the DeepMoji model is not confident about the dominant emojis assigned to the mis-classified examples.

Error Analysis
Based on the final predictions of our best model on Curious Cat test data, we find that the model is confused when the comment is a question, particularly in cases where the user did not put the question mark at the end of the sentence (e.g.without sounding like an ignorant dumba*s, what is pansexuality).We observe several such instances in our data that were labeled as neutral by the anno-tators, even if there are profane words in the comment.Therefore, it seems that the question statement could change the tone of the language from offensive to neutral.Another source of error are humiliating posts that are very short and do not contain profanity (e.g.Fix your teeth).We believe that for detecting this kind of offensive instances, we need to also consider the answer that the user provides for the received comment.
On the other hand, we are also interested to investigate the reasons behind the superiority of RA model to EA when it comes to the kaggle data.Looking at the mislabeled instances by EA, which are labeled correctly with RA, we find that our EA model is not able to correctly label general questions like "Why you gotta trash Cali?" (actual label = neutral, predicted label = negative).It is also unable to detect the insults that are indirectly addressed to the user (e.g.It must suck to be so stupid.mindless , though-controlled libturd sheeple), and some offensive slang (e.g.Back under your rock), since the DeepMoji model assigns the irrelevant emojis to them.

Conclusion and Future Work
In this paper, we propose a stacked CNN-BiLSTM model with an emotion-aware attention mechanism as a new architecture to detect online abusive behavior and hateful language.We make use of DeepMoji vectors to extract the emotion behind the text, and show it's major effects to benefit the performance of the model through the analysis section.Using our proposed model, we outperform the state-of-the-art results and several strong baselines across the three existing corpora.We also create a new resource for the task of detecting offensive language that does not focus on bad words.Our model shows very promising results over this dataset as well.
As for the future work, due to the fact that perceived level of aggression is very subjective to the user, we plan to jointly model the question and answer within a pair for the Curious Cat and ask.fm data.We believe that the reply that the user provides in response to a received question/comment is a strong indicator whether it was offensive or neutral towards the user.Another possible path in order to move the research forward, is to expand this task to the detection of cyberbullying episodes which have become a growing concern in online societies.

Figure 1 :
Figure 1: Overall architecture of the model

Figure 2 :
Figure 2: Emoji distribution over Curious Cat data Figure 3: Average probability for top emojis extracted from correctly and incorrectly predicted instances of Curious Cat data

Table 1 :
Curious Cat data distribution

Table 2 :
Data comparison

Table 3 :
Classification results in terms of F1 score for the negative/offensive class and macro F1.The results of our proposed model are significantly better than Sam'17 under the Mcnemar significance test.

Table 4 :
Table 4 shows the attention visualization for some challenging examples in the Curious Cat test data that are labeled correctly by our best model.We specifically study the instances that are very short Attention visualization for the challenging examples in the Curious Cat data that are correctly classified with our model in length.The first three rows of the table contain the examples of the word f*cking in different contexts.This word is used in two different ways: (1) To express anger, annoyance, contempt, or surprise, or (2) Referring to sexual activities.