DeepAnalyzer at SemEval-2019 Task 6: A deep learning-based ensemble method for identifying offensive tweets

This paper describes the system we developed for SemEval 2019 on Identifying and Categorizing Offensive Language in Social Media (OffensEval - Task 6). The task focuses on offensive language in tweets. It is organized into three sub-tasks for offensive language identification; automatic categorization of offense types and offense target identification. The approach for the first subtask is a deep learning-based ensemble method which uses a Bidirectional LSTM Recurrent Neural Network and a Convolutional Neural Network. Additionally we use the information from part-of-speech tagging of tweets for target identification and combine previous results for categorization of offense types.


Introduction
The use of Internet has become an important media of personal and commercial communication. In this scenario, some users take advantage of the anonymity of this kind of communication, using this to engage in behaviour that many of them would not consider in real life. Therefore, much of the offensive language is widespread in social networks. Then, studying offensive language in texts from the social media is an essential task for security, the prevention of cyber-bullying, among other abusive behavior.
To increase the research in this areas, several workshops have been organized, such as ALW 1 and TRAC 2 . Recently, OffensEval 3 (Zampieri et al., 2019b), which is a shared task at the SemEval-2019 4 workshop has been launched on the research community. The aim of OffensEval is to deal with offensive language detection in 1 https://sites.google.com/site/abusivelanguageworkshop2017/ 2 https://sites.google.com/view/trac1/home 3 https://competitions.codalab.org/competitions/20011 4 http://alt.qcri.org/semeval2019/index.php?id=tasks the English language focusing on messages from Twitter.
In OffensEval, the treatment of offensive content is divided into three subtasks taking the type and target of offenses into account: • A: Offensive language identification.
• B: Automatic categorization of offense types.
In this work, we present the methodology proposed to each of these sub-tasks, which includes an ensemble of a LSTM Recurrent Neural Network and a Convolutional Neural Network, and additionally linguistic features for the last two subtasks. The architecture of the system will be more detailed in the following sections.
The paper is organized as follows. Next section briefly describes other works in this area. Then, Section 3 describes the proposed metodology and the dataset. Results are discussed in Section 4. Finally, we draw our conclusions together with a summary of our findings in Section 5.

Related Work
Some approaches have been proposed to tackle the problem of offensive language detection. It is the case of recent works (Waseem et al., 2017;ElSherief et al., 2018;Gambäck and Sikdar, 2017;Zhang et al., 2018) and surveys (Schmidt and Wiegand, 2017) and (Fortuna and Nunes, 2018). There are even studies on languages other than English such as (Su et al., 2017) on Chinese and (Fišer et al., 2017) on Slovene.
Many of the last approaches rely on neural network models. For instance, the work of (Ganesan et al., 2018) presents a Multi-Layer Feedforward Neural Networks. Moreover, (Park and Fung, 2017) proposes to implement three models based on Convolutional Neural Networks (CNN) to classify sexist and racist abusive language: Char-CNN, WordCNN, and HybridCNN. It work reports that can boost the performance of simpler models. Also, (Pitsilis et al., 2018) proposes a detection scheme that is an ensemble of Recurrent Neural Network classifiers. It incorporates various features associated with user-related information. They report that the scheme can successfully distinguish racism and sexism messages from normal text.

Methodology and Data
The corpus provided by the organizers consists of 14,100 tweets in English. The data collection methods used to compile the dataset used in Of-fensEval is described in Zampieri et al. (2019a).
The first step is the preprocessing of the tweets, where texts are cleaned. All emoticons, hashtag and urls are removed. Then, the texts are represented as vectors with word embedding vectors. We used the pre-trained word vectors of Glove (Pennington et al., 2014), trained on 2 billion words from Twitter.
The method proposed in this work is based on an architecture that sequentially obtains the output for each of the subtasks. In the first level we use a model whose input is the word embeddings of a tweet and the output is a vector (r vector) that is taken as a compact representation of the input and is used in the following steps. For the model, two types of networks have been used. In a first approach a Recurrent Neural Network (RNN) is used, and as a second approximation a Convolutional Neuronal Network (CNN). These two models are described below.

Convolutional Neural Network
The model is a version of the convolutional neural networks presented in (Kim, 2014) for sentencelevel classification tasks. Here, the input of the model is a matrix where each row corresponds to the embedding vector of each word in the tweet. Three different filters of sizes 3, 4 and 5 are applied in a 1D convolution step to capture information from 3-grams, 4-grams and 5-grams. The feature maps produced by the convolution layer are forwarded to a Maxpooling layer. We used 2x2 filters for this pooling function on a feature map to reduce it to the single most dominant features.
Finally, the r vector is generated by the concate-nation of the results for each of the filters.

Recurent Neural Network
In NLP problems, standard LSTM Recurrent Neural Networks receive sequentially (left to right order) at each time step a word embedding w t and produces a hidden state h t . On the other hand, the bidirectional LSTM makes the same operations as standard LSTM but, processes the incoming text in a left-to-right and a right-to-left order in parallel. Thus, the output is a two hidden state at each time step − → h t and ← − h t . The proposed method uses a Bidirectional LSTM network which considers each new hidden state as the concatenation of these twoĥ The idea of this Bi-LSTM is to capture long-range and backwards dependencies.

Sub-task A
For the first sub-task, which consists in the identification of offensive language in tweets, r vector is used as input of a Fully Connected Neural Networks (FCNN) of two layers with activation function relu. The class (offensive or not) is obtained in a third layer of two units, that refer to the number of classes, with a softmax activation function.  Figure 1 shows the general scheme commented. Given this architecture, three weights of both CNN and RNN models were made. In the first weighting all the weight is for RNN (RNN run). In contrast, in the second one, all the weight is for CNN (CNN run). Finally, the third one is the actual ensemble model where both models are assigned equal weight (Ensemble run). For combining the results of both models, the system gets the mean of the predictions of each one.

Sub-tasks B and C
In the sub-task of detecting the target of offensive language, the information of the part-of-speech tagging process of the tweets is used. This allows us to make more fine-grained distinctions on the words in texts which can identify to the target of aggressiveness. For instance, this information allows to discriminate between a proper noun and other kind of noun, and if a noun is plural or singular. In this way the model can learn sequences of tags which represent each type of target. The POS labels are obtained with Standford CoreNLP and they are represented as a one hot vectors. The sequence of labels is analyzed with a LSTM RNN, and a representation p vector is obtained. Then, the concatenation of vectors r vector and p vector is used as input to another FCNN of one hidden layer with the activation function relu, and an output layer with two neurons with a softmax activation function. In this way, the prediction corresponding to the offensive target in the tweets is obtained. The Figure 2 shows this processing.
Finally, for the sub-task of classifying the types of offensive tweet, the prediction is obtained in a similar way to the previous sub-task. Here, a one hot vector corresponding to the POS tags present in the tweet is added to r vector. Then, the prediction is calculated using another FCNN.
Finally, cross entropy is used as the loss function, which is defined as: Where y i is the ground true classification of the tweet andŷ the predicted one.

Results
In the evaluation, the official ranking metric is macro-averaged F1. The results obtained in each subtask are shown in the next tables and confusion matrices. For each case, each of the three approaches discussed above (CNN run, RNN run and Ensemble run) was evaluated and the results are shown in the tables with the name that was indicated. Also, random baseline generated by assigning the same labels for all instances are included. For example, "All OFF" in sub-task A represents the performance of a system that labels everything as offensive. It was used for comparison. These results reveal a behavior not as good as expected, since although the baselines were exceeded in each case, the results were relatively far from the best results of the competition. Perhaps this is due to the fact that the different linguistic characteristics that could be extracted from tweets, such as information related to emoticons, hashtags and urls, were not analyzed in detail.

System
Another aspect to note is that for the three tasks    the best approach is based on simple models instead of a combination of models that in our case was obtained with an ensemble of models based on neural networks. So that, for two of the tasks the best results were obtained only with the use of CNN and for the other one with the RNN.

Conclusion
In this paper our solution for the OffensEval challenge in SemEval 2019 was presented. We used an ensemble of models based on deep learning, and compared the results obtained to those ob-tained with each of the models independently. As a conclusion, it can be said that it may be more important for this kind of tasks to search for properly linguistic characteristics instead of designing complex models with a lot of parameters.