Self-Contextualized Attention for Abusive Language Identification

The use of attention mechanisms in deep learning approaches has become popular in natural language processing due to its outstanding performance. The use of these mechanisms allows one managing the importance of the elements of a sequence in accordance to their context, however, this importance has been observed independently between the pairs of elements of a sequence (self-attention) and between the application domain of a sequence (contextual attention), leading to the loss of relevant information and limiting the representation of the sequences. To tackle these particular issues we propose the self-contextualized attention mechanism, which trades off the previous limitations, by considering the internal and contextual relationships between the elements of a sequence. The proposed mechanism was evaluated in four standard collections for the abusive language identification task achieving encouraging results. It outperformed the current attention mechanisms and showed a competitive performance with respect to state-of-the-art approaches.


Introduction
The integration of social media platforms into the everyday lives of billions of users has increased the number of online social interactions, promoting the exchange of different opinions and points of view that would otherwise be ignored by traditional media. The use of these social media platforms has revolutionized the way people communicate and share information. Unfortunately, not all of these interactions are constructive, as the presence of Abusive Language (AL) has spread to these media.
AL is characterized by the presence of insults, teasing, criticism and intimidation (Cecillon et al., 2019). Mainly, it includes epithets directed at an individual's characteristic, which are personally offensive, degrading and insulting. Because of its negative social impact (Kumar et al., 2018), the automatic identification of AL has stimulated the interest of social media companies and governments (Hinduja and Patchin, 2010). Derived from this, multiple efforts have been made to combat the proliferation of AL, starting from the codes of conduct, norms and regulations in the content publication on social media 1 , to the use of Natural Language Processing (NLP) for the computational analysis of language (Schmidt and Wiegand, 2017).
Concerning the several efforts and approximations made by the NLP community, one of the most relevant issues in the AL identification task is to distinguish between the use of profane words and vulgarities in offensive and non-offensive texts. This indicates that the importance and interpretation of each word is highly context dependent, and, accordingly, this particular issue evidences one of the reasons why traditional bag-of-words methods tend to generate many false positives in their predictions. Few works related to this task have explored the importance of words according to their context; particularly, the use of Deep Learning (DL) approaches with the addition of the Attention Mechanism (AM) has been explored as an alternative to solve this issue (Pavlopoulos et al., 2017;Chakrabarty et al., 2019;. The idea behind the use of the AM is to provide the classification model with the ability to focus on a subset of inputs (or features), handling in this way the importance of words in accordance to their context. Due to their outstanding performance in many NLP tasks, several AM have been proposed in recent years (Chaudhari et al., 2020), which can be divided into two main approaches: Self-Attention (SA) (Vaswani et al., 2017) and Contextual Attention (CA) (Yang et al., 2016) mechanisms. Specifically, SA takes the relationships among words within the same sentence, whereas, CA selectively focuses on words with respect to some external query vector, which adjusts according to the training task. The more important the word is in determining the answer to that query, the more focus it is given.
Despite their outstanding performance, both approaches have their own limitations. On one hand, CA ignores the internal relationships between the words of a sequence, correspondingly, SA does not consider the global relationships within the words of different sequences, which causes the loss of relevant information in the application domain (training task). Clearly, the limitations of these AM are complimentary and a hybrid AM could overcome the individual issues. In this work we extend the use of the AM by proposing the Self-Contextualized Attention (SCA) mechanism, an AM that trades off the previous limitations, by taking advantage of both SA and CA mechanisms. The proposed SCA mechanism is designed to be applied to any sequence of word encoding features, nevertheless, due to the high context-dependency of words that this specific task has, in this work we exclusively focus on the AL identification task.
The main contributions in this paper are: After identifying a Deep Neural Network (DNN) architecture that is rather stable and well-performing, we propose and integrate the SCA mechanism into the DL architecture, subsequently we conduct a quantitative and qualitative study of the effectiveness of our proposed AM against the use of SA, CA and some other novel approaches to the AL identification task. To the best of our knowledge this is the first effort in combining both AM variants. This paper is organized as follows: In Section 2, we present some previous works related to the AL identification task, along with other hybrid AM approaches. In Section 3, we describe our proposed SCA mechanism, as well as the employed classification framework; in Section 4, we present the datasets used to evaluate our SCA mechanism, their implementation details, as well as the external resources fed to the classification framework. Section 5 reports and discusses our quantitative and qualitative results. Finally, Section 6 summarizes our findings and discusses future work.

Related work
Considering the well-acknowledged increase of AL on social media platforms, several datasets (Zeerak and Dirk, 2016;Davidson et al., 2017;Marcos et al., 2019) and evaluation campaigns (Fersini et al., 2018;Kumar et al., 2018;Aragón et al., 2020), have been proposed in order to mitigate the impact of such a kind of messages.
The detection of AL has been mainly addressed from a supervised perspective, considering a great variety of features. Initial works used a combination of hand-crafted features such as bag-of-words representations, considering word and character ngrams (Burnap and Williams, 2016), as well as, syntactical and linguistical features (Nobata et al., 2016). Aiming to improve the generalization of the classifiers, some other works have explored the use of DL by taking word or character sequences from texts to learn abusive patterns without the need for explicit feature engineering; the use of word embeddings as features predominates in these works (Zhang et al., 2018;Saksesi et al., 2018;Amrutha and Bindu, 2019). More recently, there has been a trend within the NLP community regarding the use of Transformers for the improvement of text representations. In particular, for the identification of AL, transfer learning has been applied considering different pre-trained models, such as ELMO, GPT-2 and BERT (Liu et al., 2019;Nikolov and Radivchev, 2019).
Regarding the classification stage, a vast range of approaches and techniques have also been proposed. These approaches could be divided into two main categories; the first category relies on traditional classification algorithms such as Naive Bayes, Support Vector Machines (SVM), Logistic Regression and Random Forest (Burnap and Williams, 2016;Nobata et al., 2016;Davidson et al., 2017;Schmidt and Wiegand, 2017). On the other hand, the second category includes DL approaches, which rely on the use of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), to accomplish the tasks of feature extraction (Badjatiya et al., 2017;Gambäck and Sikdar, 2017) and dependency learning (Badjatiya et al., 2017;Saksesi et al., 2018). In addition to this, the combination of both types of Neural Networks have been used for the development of powerful structures that capture order information between the extracted features (Zhang et al., 2018;Amrutha and Bindu, 2019).
Finally, most recent works in abusive AL identification have considered DL architectures with the addition of an AM. One of the first works introducing attention into the task used the SA mechanism to detect abuse in portal news and Wikipedia (Pavlopoulos et al., 2017). Subsequently, (Chakrabarty et al., 2019) showed that the use of CA introduced by (Yang et al., 2016) improved the results of SA in this task. Later in (Jarquín-  the use of the CA is extended at a word n-grams level, showing the advantages in the usage of word sequences when identifying AL. Regarding other tasks outside the AL identification, some hybrid AMs have been proposed for the combination and representation of different instances and modalities (Khullar and Arora, 2020;Zhang et al., 2020), unlike these hybrid approaches, the proposed SCA mechanism combines the features of the SA and CA mechanisms at an instance level. Motivated by these previous works and with the goal of creating an AM that handles both, the internal and external relationships between words, in this paper we propose the SCA mechanism.

Self-contextualized attention
This section is divided into two subsections. First we introduce our proposed SCA mechanism, which is designed to be applied to any sequence of encoding features. Subsequently, we present the DNN architecture used as our classification framework. For more details related to the AMs, we refer the reader to the following work: (Chaudhari et al., 2020).

Self-contextualized attention mechanism
Given a sequence of encoding features H = {h 1 , h 2 , ..., h n }, where H ∈ R k×n , k is the number of the encoding features and h i refers to the i-th element of H, the purpose of our proposed SCA mechanism is to generate a global context-aware representation G, that considers both the internal and external relationships between the encoding features of H. Figure 1 shows the general architecture of our proposed SCA mechanism. This architecture is divided into three major stages, each of them is illustrated by the 3 rectangles, corresponding to the SA, CA and SCA stages. Below, we present in detail the aforementioned stages.
SA stage: as in (Pavlopoulos et al., 2017) the main purpose of SA is the building of connections within the elements of the same sequence, but at different positions. The use of SA allows the modeling of both long-range and local dependencies, this is captured by the attention filter α s ∈ R n×n defined in the Equation 1. This attention filter is calculated with the dot product similarity between all the pairs of elements of H, later these values are smoothed with the use of a softmax function. Finally, the context-aware representation S ∈ R k×n shown in the Equation 2, is calculated with the matrix multiplication of H and α T s , where α s is used to highlight and filter out the most and less relevant encoding features, respectively.
CA stage: unlike the previous stage, the CA mechanism uses a context vector u h ∈ R k , which is randomly initialized and jointly learned during the training process, this vector is used as a query vector in order to obtain the attention values α c ∈ R n by measuring the similarity between the elements of the sequence H and the application domain represented by u h . This similarity is calculated in the Equation 3 by calculating the scalar dot product of u T h and H; the resulting values are smoothed with the use of a softmax function. Contrasting the CA mechanism proposed by (Yang et al., 2016), instead of using a weighted sum between each attention value and its corresponding encoding features for the final sequence representation, our contextaware representation C ∈ R k×n shown in Equation 4, takes all the information of the attention values, by doing an element-wise multiplication ⊙, within each scalar of α c and its corresponding encoding features h i .
SCA stage: since the previous stages generate two different context-aware representations S and C, respectively. The purpose of this stage is to merge these representations in order to create a global context-aware representation G ∈ R k×n that integrates both, the internal and external relationships. These relationships are captured with the global attention filter α g ∈ R n×n , which is calculated by the smoothed dot product similarity between S and C, as shown in Equation 5. This attention filter can be seen as a high level attention representation, since it is calculated based on the local dependencies and the application domain. Finally, the global context-aware representation G is calculated in Equation 6 with the matrix multiplication of H and α T g .
The proposed SCA mechanism can be applied to any sequence of encoding features H. For the purposes of this work, each element of the sequence is represented by the word encoding features h i .

Classification framework
In order to integrate our proposed SCA mechanism into the AL identification task, we adapt a modular and well-performing DNN architecture, as our classification framework. This architecture was presented in (Yang et al., 2016;Chakrabarty et al., 2019) and its designed to modularly manage different AM. The adapted architecture is shown in Figure 2; it consists of four main stages, which are described below.
The first and second stages correspond to the input and encoding stages, respectively. The input stage is integrated by the embedding matrix X ∈ R d×n , which is represented by a sequence of n d-dimensional word vectors x i . Subsequently, the embedding matrix X passes as input to the encoding stage, which is conformed by a Bidirectional Gated Recurrent Unit (Bi-GRU) layer. The Bi-GRU layer accomplish the sequence encoding task by summarizing the information of the whole sequence X centered around each word annotation; the producing encoding stage generates a sequence of encoding features H ∈ R k×n .
Since not all words contribute equally for the meaning and representation of a sequence, the third stage corresponds to the attention stage, including the SCA mechanism and the average pooling layer. Specifically, the sequence encoded features H are passed as input to the SCA mechanism, which generates a global context-aware representation G; since the next stage uses a vector for the classification layers, the matrix G is reduced with the average pooling layer, generating a high level representation vector g ∈ R k , which summarizes the most relevant information from G. Finally, the Fourth stage uses the representation vector g as input for the classification layers; two layers handle the final classification, a dense layer with a Rectified Linear Unit (ReLU) activation function, and a fully-connected softmax layer to obtain the class probabilities and get the final classification. The implementation details and the hyperparameter settings are presented in Section 4.2.

Experimental settings
This section presents the experimental settings. First, we introduce the four evaluation datasets, which correspond to Twitter collections. Then, with the purpose of facilitating the replicability of our results, we present our method's implementation details, starting from the text preprocessing phase, up to the configuration of the classification framework.

Datasets for AL identification
AL can be of different types, its main divisions are distinguished by the target and severity of the insults. Accordingly, different collections and evaluation campaigns have considered different kinds of AL for its study. Below we present a brief description of the four English datasets we used in our experiments. From now on we will refer to them as DS1, DS2, DS3, and DS4. DS1 (Davidson et al., 2017) and DS2 (Zeerak and Dirk, 2016) were some of the first large-scale datasets for abusive tweet detection; DS1 focuses on the identification of racist and sexist tweets, whereas DS2 focuses on identifying tweets with abusive language and hate speech. On the other hand, DS3 (Marcos et al., 2019) and DS4 (Fersini et al., 2018) were used in the SemEval-2019 Task 6, and in the Evalita 2018 Task on Automatic Misogyny Identification (AMI) respectively. DS3 focuses on identifying offensive tweets, whereas DS4 focuses on identifying misogyny in tweets. Both shared tasks provide a fine-grained evaluation through different sub-tasks; in this work, we focus on the sub-task A (binary classification of offenses and misogyny, respectively). Figure 3 resumes the information about the classes distribution of the four collections.

Implementation details
Different text preprocessing operations were applied: user mentions and links were replaced by the default tokens <user> and <url>; in order to enrich the vocabulary, all hashtags were segmented by words (e.g. #BuildTheWall -build the wall) with the use of the ekphrasis library, proposed in (Baziotis et al., 2017); in addition to this, all emojis were converted into words (e.g. -smiley face) using the demoji 2 library; stop words were removed, with the exception of personal pronouns; all text was lowercased and non-alphabetical characters as well as consecutive repeated words were removed. For word representation we used pre-trained fastText embeddings (Mikolov et al., 2018), trained with subword information on Common Crawl, which have been recognized as useful for this task according to the study presented in (Corazza et al., 2020). using the Adam optimizer (Kingma and Ba, 2015) and a Dropout rate of 15%. In order to compare the robustness of our proposal, we consider four baseline architectures: the first architecture is based on a simple Bi-GRU network, which receives words as input but does not use any attention layers; the second and third architectures employ the same Bi-GRU network with the addition of a SA and CA layer, respectively; finally, in order to compare the performance of our proposed SCA mechanism against a novel AL identification approach, the fourth baseline is based on a fine-tuned BERT 3 base model (12 layers, 768 hidden size, 12 attention heads per layer), built with the addition of the taskspecific inputs and the end-to-end fine-tuning of all parameters. As described in (Devlin et al., 2019), we take the last layer encoding of the classification token <CLS> and use it as input for the softmax classification layer. These four baselines architectures and our classification framework are referred in the experiments as: Bi − GRU , Bi − GRU SA , Bi − GRU CA , BERT BASE , and Bi − GRU SCA , respectively. It is important to mention that the first three baseline architectures used the same hyperparameter settings.

Experimental results
This section is organized in three subsections. Sections 5.1 and 5.2 present the quantitative results of the experimentation, corresponding to the comparison of our proposed SCA mechanism against the baselines and state-of-the-art results. Finally, Section 5.3 presents some qualitative results of the SCA mechanism, through the analysis and visualization of the attention values.  Table 2 shows the results of the mean and standard deviation corresponding to the 10-fold cross validation evaluation applied to our classification framework DNN architecture (Bi − GRU SCA ), as well as the four baselines simplified architectures Bi − GRU , Bi − GRU SA , Bi − GRU CA and BERT BASE . For sake of comparison, we evaluate all the collections using the macro-average F 1 score, which is commonly used in the AL identification task.

Quantitative effectiveness of the SCA mechanism
Centering the analysis of results on the first three baselines and on our classification framework (columns 2 -5), the results indicate that the use of AM outperformed the base Bi-GRU network (column 2 vs columns 3 -5) by at least a margin of 1.1%. In addition, the use of the CA outperformed the use of SA (column 4 vs column 3) by at least a margin of 1.2%, which is consistent according to the results obtained in (Chakrabarty et al., 2019). Finally, comparing the use of our proposed SCA mechanism against the use of SA and CA (column 5 vs columns 3 and 4), better results are obtained in the four evaluation datasets, improving the results by at least a margin of 1.1%. Since the use CA baseline outperforms the SA based one, we compared Bi − GRU SCA vs Bi − GRU CA with the Chi Squared Test, obtaining statistically significant values with p ≤ 0.001. Table 2 also compares the results from our proposed SCA mechanism with respect to the BERT BASE baseline (column 5 vs column 6). It is shown that the Bi−GRU SCA DNN obtained better results in 3 out of 4 datasets. In addition to the outstanding results, the use of our Bi − GRU SCA DNN has a considerably lower number of parameters compared to the BERT BASE model (110M vs 7M), which greatly reduces the computing power necessary to run our DNN. Finally, compared to some novel approaches for the AL identification task (Alshaalan and Al-Khalifa, 2020), our DNN improves the model interpretability, through the SCA mechanism.

Comparison with the state-of-the-art
In this subsection we compare our proposed DNN architecture (Bi − GRU SCA ) with state-of-the-art approaches. Since the datasets DS1 and DS2 are presented as a single dataset, in order to have a fair comparison with other works, these were partitioned into 80% for training, 10% for validation and 10% for testing, in addition, the weightedaverage F 1 score was used as an evaluation measure for these datasets. In the case of DS3 and DS4 datasets, the partitions corresponding to the training and testing were considered for the evaluation; since these datasets come from shared tasks, the evaluation measures were adjusted to each of them, specifically, DS3 and DS4 were evaluated using the macro-average F 1 score and the accuracy, respectively. Table 3 presents the results of our proposed Bi − GRU SCA DNN architecture in comparison with state-of-the-art results. It shows that the Bi−GRU SCA DNN obtained better results in 2 out of 4 datasets. It is important to note that the state-ofthe-art results from the DS2 and DS3 datasets only improved our results by margin of 1% and 0.03%, respectively. Specifically, in (Mozafari et al., 2019), which corresponds to the DS1 and DS2 state-ofthe-art results, the use of a BERT-based CNN is implemented for the feature extraction of the transformer encoders, generating a hierarchical encoded vector, used for the AL classification.
Regarding the state-of-the-art results from the DS3 and DS4 datasets, the best performance teams corresponding to each shared task were considered, on the one hand, NULI the best performance team in the DS3 shared task (Liu et al., 2019), used a BERT-base-uncased model with defaultparameters, using a max sentence length of 64 and a variety of text pre-processing techniques, on the   Table 3: Comparison results from our classification framework and state-of-the-art approaches in four datasets for AL identification (DS1 and DS2 were evaluated with the weighted-average F 1 , DS3 and DS4 were evaluated using the macro-average F 1 and the accuracy, respectively). other hand, hateminers achieved the highest performance on the DS4 shared task (Saha et al., 2018), with a run based on a vector representation that concatenates sentence embedding, TF-IDF and average word embeddings coupled with a Logistic Regression model. Unlike the reported state-of-theart approaches, the use of our SCA mechanism on a simple and well-performed DNN, obtains competitive results, without the use of complex DNN (Mozafari et al., 2019), or large amounts of resources and features (Saha et al., 2018).
The boxplot graphs shown in Figure 5, compares our Bi − GRU SCA performance results (red rhombus) against the top-10 results corresponding to the shared tasks SemEval 2019 Task 6 and AMI Evalita 2018, respectively. As shown in the graphs, our results are competitive with respect to the top-10 results obtained by the best participating teams in each sub-task A. In both boxplot graphs our results remain above the third quartile, specifically, in the AMI Evalita 2018 shared task an outstanding performance is obtained with the use of our proposed SCA mechanism in the classification framework.

Qualitative effectiveness of the SCA mechanism
NOTE: This subsection contains examples of language that may be offensive to some readers, these do not represent the perspectives of the authors.
In order to understand the effectiveness of our proposed SCA mechanism in the improvement of the sequences representation, this subsection presents the qualitative results of the analysis and visualization of the attention values. Since the SCA mechanism integrates both, the SA and CA mechanisms, the attention values were considered at these three different levels, with the analysis of the α s , α c and α g attention filters, which correspond to the SA, CA and SCA mechanisms. Figure 4 shows the visualization of the attention heatmaps corresponding to the three attention filters values integrated by the SCA mechanism. The example shown in the figure "<user> who is the loser bitch fuck you <url>" corresponds to an offensive instance taken from the DS3 dataset. As shown in the figure, the values of the attention filter α s , corresponding to the SA, tend to be more relevant with respect to their own elements and their closest neighbors, for example, in the case of the most relevant words to "who", the same word "who" is found, followed by the word "is", likewise, in the case of the most relevant words to "fuck", the words "fuck", "you" and "bitch" are found. On the other hand, the values of the attention filter α c , corresponding to the CA, indicate the most relevant words for the AL identification; as can be seen in the central heatmap from the Figure 4, the most relevant words are: "loser", "bitch" and "fuck", which indeed correspond to words potentially used in offensive contexts.
Finally, the values of the attention filter α g , corresponding to the SCA, are shown in the right heatmap from Figure 4. The attention filter α g shows the combination of both AM, which improves the representation of an instance. For example, in the produced visualization from the most relevant words to "<user>", a closer relationship to offensive words is now presented, highlighting the words: "loser", "bitch" and "fuck", which are often used to offend, something similar is presented with the words "who" and "is". On the other hand, the words "fuck", "you" and "bitch", in addition to having a better relationship with other offensive words as "loser", are also related to the target of the offense: "<user>".

Conclusions and future work
One of the main problems in the use of current AMs is the loss of contextual or internal information between the elements of a sequence. To tackle this issue we proposed the SCA mechanism, which integrates the SA and CA mechanisms for the construction of a representation that considers both, the internal and contextual relationships between the elements of a sequence. Due to the highly context-dependent interpretation of words in the AL identification, in this work we explore the use of the proposed SCA mechanism in the AL identification. The results obtained in four collections, considering different kinds of AL, were encouraging; they improved state-of-the-art approaches in 2 out of 4 datasets. In addition to this, the SA and CA mechanisms were evaluated against the SCA mechanism, the results show a quantitative and qualitative improvement in the use of the SCA mechanism, which allowed concluding that the use of the SCA mechanism is useful for discriminating between offensive and non-offensive contexts.
Since the most recent approaches are based on Transformers, as future work we plan to explore the use of our proposed SCA mechanism in the design of a multi-head SCA architecture. Additionally, we consider exploring new ways of combining the SA and CA mechanisms, as well as some novel approaches in the building of the SCA mechanism without the need of computing the SA and CA mechanisms individually. Finally, we consider the application of the proposed SCA mechanism in other related tasks where the interpretation of words is highly context dependent such as the detection of deception or the detection of depressed social media users.