Joint Modelling of Emotion and Abusive Language Detection

The rise of online communication platforms has been accompanied by some undesirable effects, such as the proliferation of aggressive and abusive behaviour online. Aiming to tackle this problem, the natural language processing (NLP) community has experimented with a range of techniques for abuse detection. While achieving substantial success, these methods have so far only focused on modelling the linguistic properties of the comments and the online communities of users, disregarding the emotional state of the users and how this might affect their language. The latter is, however, inextricably linked to abusive behaviour. In this paper, we present the first joint model of emotion and abusive language detection, experimenting in a multi-task learning framework that allows one task to inform the other. Our results demonstrate that incorporating affective features leads to significant improvements in abuse detection performance across datasets.


Introduction
Aggressive and abusive behaviour online can lead to severe psychological consequences for its victims (Munro, 2011). This stresses the need for automated techniques for abusive language detection, a problem that has recently gained a great deal of interest in the natural language processing community. The term abuse refers collectively to all forms of expression that vilify or offend an individual or a group, including racism, sexism, personal attacks, harassment, cyber-bullying, and many others. Much of the recent research has focused on detecting explicit abuse, that comes in the form of expletives, derogatory words or threats, with substantial success (Mishra et al., 2019b). However, abuse can also be expressed in more implicit and subtle ways, for instance, through the use of am-biguous terms and figurative language, which has proved more challenging to identify.
The NLP community has experimented with a range of techniques for abuse detection, such as recurrent and convolutional neural networks (Pavlopoulos et al., 2017;Park and Fung, 2017;Wang, 2018), character-based models (Nobata et al., 2016) and graph-based learning methods (Mishra et al., 2018a;Aglionby et al., 2019;Mishra et al., 2019a), obtaining promising results. However, all of the existing approaches have focused on modelling the linguistic properties of the comments or the meta-data about the users. On the other hand, abusive language and behaviour are also inextricably linked to the emotional and psychological state of the speaker (Patrick, 1901), which is reflected in the affective characteristics of their language (Mabry, 1974). In this paper, we propose to model these two phenomena jointly and present the first abusive language detection method that incorporates affective features via a multitask learning (MTL) paradigm.
MTL (Caruana, 1997) allows two or more tasks to be learned jointly, thus sharing information and features between the tasks. In this paper, our main focus is on abuse detection; hence we refer to it as the primary task, while the task that is used to provide additional knowledge -emotion detection -is referred to as the auxiliary task. We propose an MTL framework where a single model can be trained to perform emotion detection and identify abuse at the same time. We expect that affective features, which result from a joint learning setup through shared parameters, will encompass the emotional content of a comment that is likely to be predictive of potential abuse.
We propose and evaluate different MTL architectures. We first experiment with hard parameter sharing, where the same encoder is shared between the tasks. We then introduce two variants of the MTL model to relax the hard sharing constraint and further facilitate positive transfer. Our results demonstrate that the MTL models significantly outperform single-task learning (STL) in two different abuse detection datasets. This confirms our hypothesis of the importance of affective features for abuse detection. Furthermore, we compare the performance of MTL to a transfer learning baseline and demonstrate that MTL provides significant improvements over transfer learning.

Related Work
Techniques for abuse detection have gone through several stages of development, starting with extensive manual feature engineering and then turning to deep learning. Early approaches experimented with lexicon-based features (Gitari et al., 2015), bagof-words (BOW) or n-gram features (Sood et al., 2012;Dinakar et al., 2011), and user-specific features, such as age (Dadvar et al., 2013) and gender (Waseem and Hovy, 2016).
With the advent of deep learning, the trend shifted, with abundant work focusing on neural architectures for abuse detection. In particular, the use of convolutional neural networks (CNNs) for detecting abuse has shown promising results (Park and Fung, 2017;Wang, 2018). This can be attributed to the fact that CNNs are well suited to extract local and position-invariant features (Yin et al., 2017). Character-level features have also been shown to be beneficial in tackling the issue of Out-of-Vocabulary (OOV) words (Mishra et al., 2018b), since abusive comments tend to contain obfuscated words. Recently, approaches to abuse detection have moved towards more complex models that utilize auxiliary knowledge in addition to the abuse-annotated data. For instance, Mishra et al. (2018aMishra et al. ( , 2019a used community-based author information as features in their classifiers with promising results. Founta et al. (2019) used transfer learning to fine-tune features from the author metadata network to improve abuse detection.
MTL, introduced by Caruana (1997), has proven successful in many NLP problems, as illustrated in the MTL survey of Zhang and Yang (2017). It is interesting to note that many of these problems are domain-independent tasks, such as part-of-speech tagging, chunking, named entity recognition, etc. (Collobert and Weston, 2008). These tasks are not restricted to a particular dataset or domain, i.e., any text data can be annotated for the phenomena involved. On the contrary, tasks such as abuse detection are domain-specific and restricted to a handful of datasets (typically focusing on online communication), therefore presenting a different challenge to MTL.
Much research on emotion detection cast the problem in a categorical framework, identifying specific classes of emotions and using e.g., Ekman's model of six emotions (Ekman, 1992), namely anger, disgust, fear, happiness, sadness, surprise. Other approaches adopt the Valence-Arousal-Dominance (VAD) model of emotion (Mehrabian, 1996), which represents polarity, degree of excitement, and degree of control, each taking a value from a range. The community has experimented with a variety of computational techniques for emotion detection, including vector space modelling (Danisman and Alpkocak, 2008), machine learning classifiers (Perikos and Hatzilygeroudis, 2016) and deep learning methods (Zhang et al., 2018). In their work, Zhang et al. (2018) take an MTL approach to emotion detection. However, all the tasks they consider are emotion-related (annotated for either classification or emotion distribution prediction), and the results show improvements over single-task baselines. Akhtar et al. (2018) use a multitask ensemble architecture to learn emotion, sentiment, and intensity prediction jointly and show that these tasks benefit each other, leading to improvements in performance. To the best of our knowledge, there has not yet been an approach investigating emotion in the context of abuse detection.

Datasets
The tasks in an MTL framework should be related in order to obtain positive transfer. MTL models are sensitive to differences in the domain and distribution of data (Pan and Yang, 2009). This affects the stability of training, which may deteriorate performance in comparison to an STL model (Zhang and Yang, 2017). We experiment with abuse and emotion detection datasets 1 that are from the same data domain -Twitter. All of the datasets were subjected to the same pre-processing steps, namely lower-casing, mapping all mentions and URLs to a common token (i.e., MTN and URL ) and mapping hashtags to words.

Abuse detection task
To ensure that the results are generalizable, we experiment with two different abuse detection datasets.
OffensEval 2019 (OffensEval) This dataset is from SemEval 2019 -Task 6: OffensEval 2019 -Identifying and Categorizing Offensive Language in Social Media (Zampieri et al., 2019a,b). We focus on Subtask A, which involves offensive language identification. It contains 13, 240 annotated tweets, and each tweet is classified as to whether it is offensive (33%) or not (67%). Those classified as offensive contain offensive language or targeted offense, which includes insults, threats, profane language and swear words. The dataset was annotated using crowdsourcing, with gold labels assigned based on the agreement of three annotators.
Waseem and Hovy 2016 (Waseem&Hovy) This dataset was compiled by Waseem and Hovy (2016) by searching for commonly used slurs and expletives related to religious, sexual, gender and ethnic minorities. The tweets were then annotated with one of three classes: racism, sexism or neither. The annotations were subsequently checked through an expert review, which yielded an interannotator agreement of κ = 0.84. The dataset contains 16, 907 TweetIDs and their corresponding annotation, out of which only 16, 202 TweetIDs were retrieved due to users being reported or tweets having been taken down since it was first published in 2016. The distribution of classes is: 1, 939 (12%) racism; 3, 148 (19.4%) sexism; and 11, 115 (68.6%) neither, which is comparable to the original distribution: (11.7% : 20.0% : 68.3%).
It should be noted that racial or cultural biases may arise from annotating data using crowdsourcing, as pointed out by Sap et al. (2019). The performance of the model depends on the data used for training, which in turn depends on the quality of the annotations and the experience level of the annotators. However, the aim of our work is to investigate the relationship between emotion and abuse detection, which is likely to be independent of the biases that may exist in the annotations.

Emotion detection task
Emotion (SemEval18) This dataset is from SemEval-2018 Task 1: Affect in Tweets (Mohammad et al., 2018), and specifically from Subtask 5 which is a multilabel classification of 11 emotion labels that best represent the mental state of the author of a tweet. The dataset consists of around 11k tweets (training set: 6839; development set: 887; test set: 3260). It contains the TweetID and 11 emotion labels (anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust) which take a binary value to indicate the presence or absence of the emotion. The annotations were obtained for each tweet from at least 7 annotators and aggregated based on their agreement.

Approach
In this section, we describe our baseline models and then proceed by describing our proposed models for jointly learning to detect emotion and abuse.

Single-Task Learning
As our baselines, we use different Single-Task Learning (STL) models that utilize abuse detection as the sole optimization objective. The STL experiments are conducted for each primary-task dataset separately. Each STL model takes as input a sequence of words {w 1 , w 2 , ..., w n }, which are initialized with k-dimensional vectors e from a pretrained embedding space. We experiment with two different architecture variants: Max Pooling and MLP classifier We refer to this baseline as STL maxpool+M LP . In this baseline, a two-layered bidirectional Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997) is applied to the embedding representations e of words in a post to get contextualized word representations {h 1 , h 2 , ..., h n }: where l is the hidden dimensionality of the BiLSTM. We then apply a max pooling operation over {h 1 , h 2 , ..., h n }: where r (p) ∈ R 2·l and where the superscript (p) is used to indicate that the representations correspond to the primary task. This is followed by dropout (Srivastava et al., 2014) for regularization and a 2-layered Multi-layer Perceptron (MLP) (Hinton, 1987): where W l 1 and W l 2 are the weight matrices of the 2-layer MLP. Dropout is applied to the output m (p) of the MLP, which is then followed by a linear output layer to get the unnormalized output o (p) . For OffensEval, a sigmoid activation σ is then applied in order to make a binary prediction with respect to whether a post is offensive or not, while the network parameters are optimized to minimize the binary cross-entropy (BCE): where N is the number of training examples, and y denotes the true and p(y) the predicted label. For Waseem&Hovy, a log sof tmax activation is applied for multiclass classification, while the network parameters are optimized to minimize the categorical cross-entropy, that is, the negative loglikelihood (NLL) of the true labels: BiLSTM and Attention classifier We refer to this model as STL BiLST M +attn . In this baseline (Figure 1; enclosed in the dotted boxes), rather than applying max pooling, we apply dropout to h which is then followed by a third BiLSTM layer and an attention mechanism: where r (p) is the output of the third BiLSTM. We then apply dropout to the output of the attention layer m (p) . The remaining components, output layer and activation, are the same as the STL maxpool+M LP model.
Across the two STL baselines, we further experiment with two different input representations: 1) GloVe (G), where the input is projected through the GloVe embedding layer (Pennington et al., 2014); 2) GloVe+ELMo (G+E), where the input is first projected through the GloVe embedding layer and the ELMo embedding layer (Peters et al., 2018) separately, and then the final word representation e is obtained by concatenating the output of these two layers. Given these input representations, we have a total of 4 different baseline models for abuse detection. We use grid search to tune the hyperparameters of the baselines on the development sets of the primary task (i.e., abuse detection).

Multi-task Learning
Our MTL approach uses two different optimization objectives: one for abuse detection and another for emotion detection. The two objectives are weighted by a hyperparameter β [(1 − β) for abuse detection and β for emotion detection] that controls the importance we place on each task. We experiment with different STL architectures for the auxiliary task and propose MTL models that contain two network branches -one for the primary task and one for the auxiliary task -connected by a shared encoder which is updated by both tasks alternately.
Hard Sharing Model This model architecture, referred to as MTL Hard , is inspired by Caruana (1997) and uses hard parameter sharing: it consists of a single encoder that is shared and updated by both tasks, followed by task-specific branches. Figure 1 presents MTL Hard where the dotted box represents the STL BiLST M +attn architecture that is specific to the abuse detection task. In the righthand side branch -corresponding to the auxiliary objective of detecting emotion -we apply dropout to h before passing it to a third BiLSTM. This is then followed by an attention mechanism to obtain m (a) and then dropout is applied to it. The superscript (a) is used to indicate that these representations correspond to the auxiliary task. Then, we obtain the unnormalized output o (a) after passing m (a) through a linear output layer with o (a) ∈ R 11 (11 different emotions in SemEval18), which is then subjected to a sigmoid activation to obtain a prediction p(y). While the primary task on the left is optimized using either Equation 6 or 7 (depending on the dataset used), the auxiliary task is optimized to minimize binary cross-entropy.  Double Encoder Model This model architecture, referred to as MTL DEncoder , is an extension of the previous model that now has two BiL-STM encoders: a task-specific two-layered BiL-STM encoder for the primary task, and a shared two-layered BiLSTM encoder. During each training step of the primary task, the input representation e for the primary task is passed through both encoders, which results in two contextualized word representations {h  Figure  2, where both α (p) and α (s) are fixed and set to 1) and the output representation is passed through a third BiLSTM followed by an attention mechanism to get the post representation m (p) . The rest of the components of the primary task branch, as well as the auxiliary task branch are the same as those in MTL Hard .
Gated Double Encoder Model This model architecture, referred to as MTL GatedDEncoder , is an extension of MTL DEncoder , but is different in the way we obtain the post representations m (p) . Representations h (p) and h (s) are now merged using two learnable parameters α (p) and α (s) (where α (p) +α (s) = 1.0) to control the flow of information from the representations that result from the two encoders ( Figure 2): The remaining architecture components of the primary task and auxiliary task branch are the same as for MTL DEncoder .

Experimental setup
Hyperparameters We use pre-trained GloVe embeddings 2 with dimensionality 300 and pretrained ELMo embeddings 3 with dimensionality 1024. Grid search is performed to determine the optimal hyperparameters. We find an optimal value of β = 0.1 that makes the updates for the auxiliary task 10 times less important. The encoders consist of 2 stacked BiLSTMs with hidden size = 512. For all primary task datasets, the BiLSTM+Attention classifier and the 2-layered MLP classifier have hidden size = 256. For the auxiliary task datasets, the BiLSTM+Attention classifier and the 2-layered MLP classifier have hidden size = 512. Dropout is set to 0.2. We use the Adam optimizer (Kingma and Ba, 2014) for all experiments. All model weights are initialized using Xavier Initialization (Glorot and Bengio, 2010). For MTL GatedDEncoder , α (p) = 0.9 and α (s) = 0.1.  Training All models are trained until convergence for both the primary and the auxiliary task, and early stopping is applied based on the performance on the validation set. For MTL, we ensure that both the primary and the auxiliary task have completed at least 5 epochs of training. The MTL training process involves randomly (with p = 0.5) alternating between the abuse detection and emotion detection training steps. Each task has its own loss function, and in each of the corresponding task's training step, the model is optimized accordingly. All experiments are run using stratified 10fold cross-validation, and we use the paired t-test for significance testing. We evaluate the models using Precision (P ), Recall (R), and F1 (F 1), and report the average macro scores across the 10 folds.

STL experiments
The STL experiments are conducted on the abuse detection datasets independently. As mentioned in the STL section, we experiment with four different model configurations to select the best STL baseline. Table 1a presents the evaluation results of the STL models trained and tested on the OffensEval dataset, and Table 1b on the Waseem and Hovy dataset. The best results are highlighted in bold and are in line with the validation set results. We select the best performing STL model configuration on each dataset and use it as part of the corresponding MTL architecture in the MTL experiments below.

MTL experiments
In this section, we examine the effectiveness of the MTL models for the abuse detection task and explore the impact of using emotion detection as an auxiliary task. We also compare the performance of our MTL models with that of a transfer learning approach.
Emotion detection as an auxiliary task In this experiment, we test whether incorporating emotion detection as an auxiliary task improves the performance of abuse detection. Tables 2a and 2b show the results on OffensEval and Waseem and Hovy datasets ( † indicates statistically significant results over the corresponding STL model). Learning emotion and abuse detection jointly proved beneficial, with MTL models achieving statistically significant improvement in F1 using the Gated Double Encoder Model MTL GatedDEncoder (p < 0.05, using a paired t-test). This suggests that affective features from the shared encoder benefit the abuse detection task.
MTL vs. transfer learning Transfer learning is an alternative to MTL that also allows us to transfer knowledge from one task to another. This experiment aims to compare the effectiveness of MTL against transfer learning. We selected the MTL model with the best performance in abuse detection and compared it against an identical model, but trained in a transfer learning setting. In this setup, we first train the model on the emotion detection task until convergence and then proceed by fine-tuning it for the abuse detection task. Table 3 presents the comparison between MTL and transfer learning, for which we use the same architecture and hyperparameter configuration as MTL. We observe that MTL outperforms transfer learning and provides statistically significant (p < 0.05) results on both OffensEval and Waseem and Hovy datasets.

Discussion
Auxiliary task Our results show that emotion detection significantly improves abuse detection on both OffensEval and Waseem and Hovy datasets. Table 4 presents examples of improvements in both datasets achieved by the MTL GatedDEncoder model, over the STL model. In the examples, the highlighted words are emotion evocative words, which are also found in the SemEval2018 Emotion dataset. As the emotion detection task encourages the model to learn to predict the emotion labels for the examples that contain these words, the word representations and encoder weights that are learned by the model encompass some affective knowledge. Ultimately, this allows the MTL model to determine the affective nature of the example, which may help it to classify abuse more accurately. It is also interesting to observe that a controversial person or topic may strongly influence the classification of the sample containing it. For example, sentences referring to certain politicians may be classified as Offensive, regardless of the context. An example instance of this can be found in Table  4. 4 The MTL model, however, classifies it correctly, which may be attributed to the excessive use of "!" marks. The latter is one of the most frequently used symbols in the SemEval2018 Emotion dataset, and it can encompass many emotions such as surprise, fear, etc., therefore, not being indicative of a particular type of emotion. Such knowledge can be learned within the shared features of the MTL model.  MTL vs. transfer learning This experiment demonstrates that MTL achieves higher performance than transfer learning in a similar experimental setting. The higher performance may be indicative of a more stable way of transferring knowledge, which leads to better generalization. In the MTL framework, since the shared parameters are updated alternately, each task learns some knowledge that may be mutually beneficial to both related tasks, which leads to a shared representation that encompasses the knowledge of both tasks and hence is more generalized. In contrast, in the case of transfer learning, the primary task fine-tunes the knowledge from the auxiliary task (i.e., in the form of pre-trained parameters) for its task objective and may be forgetting auxiliary task knowledge.

Conclusion
In this paper, we proposed a new approach to abuse detection, which takes advantage of the affective features to gain auxiliary knowledge through an MTL framework. Our experiments demonstrate that MTL with emotion detection is beneficial for the abuse detection task in the Twitter domain. The mutually beneficial relationship that exists between these two tasks opens new research avenues for improvement of abuse detection systems in other domains as well, where emotion would equally play a role. Overall, our results also suggest the superiority of MTL over STL for abuse detection. With this new approach, one can build more complex models introducing new auxiliary tasks for abuse detection. For instance, we expect that abuse detection may also benefit from joint learning with complex semantic tasks, such as figurative language processing and inference.