HAD-Tübingen at SemEval-2019 Task 6: Deep Learning Analysis of Offensive Language on Twitter: Identification and Categorization

This paper describes the submissions of our team, HAD-Tübingen, for the SemEval 2019 - Task 6: “OffensEval: Identifying and Categorizing Offensive Language in Social Media”. We participated in all the three sub-tasks: Sub-task A - “Offensive language identification”, sub-task B - “Automatic categorization of offense types” and sub-task C - “Offense target identification”. As a baseline model we used a Long short-term memory recurrent neural network (LSTM) to identify and categorize offensive tweets. For all the tasks we experimented with external databases in a postprocessing step to enhance the results made by our model. The best macro-average F1 scores obtained for the sub-tasks A, B and C are 0.73, 0.52, and 0.37, respectively.


Introduction
The use of offensive language is an ubiquitous problem one faces when using social networking services like Twitter. Users of such services often take advantage of the anonymity of the individual platforms for using the computer-mediated communication to engage in offensive behaviour against individuals, groups and/or organizations. Due to increasing problems with offensive language and a raising demand for offensive language detection on platforms like Twitter, tasks, similar to the current one have already become popular for several different languages: English (Waseem et al., 2017), German (Wiegand et al., 2018) and Spanish (Rosso et al., 2018). With increasing popularity of Twitter, over 1.48 billion users (June 2013) and still new accounts signing up every day, the need for improvement on tackling the well known problem of insults inside the platform has become more and more necessary.
The Twitter platform 1 describes itself as a connection to "what's happening in the world and what people are talking about right now". For this reason alone, its data attracts more and more NLP researchers all over the world. "Tweets", the messages one can send over this platform can be described as micro-texts, limited to 280 characters, over which users can interact with each other or simply post statements. Since the input is up to the user, one could include misspellings, emoticons, hashtags but also slang and abusive words, what makes those messages a valuable source for different analyses.
As was mentioned in the beginning, the goal of this paper is to consider our approach for the Se-mEval 2019 -Task 6: "OffensEval: Identifying and Categorizing Offensive Language in Social Media", for task information (see Zampieri et al. 2019b) and for dataset description (see Zampieri et al. 2019a). We took part in all of the three subtasks, using an LSTM based classifier. In the remainder of the paper, we describe our methods and discuss both our results and suggestions for further work.

System description
Neural network models have recently gained more and more popularity for text classification tasks, since they perform quite efficiently in modeling of sequences and offer advantages for computation. For this competition, we used unidirectional LSTM, where the recurrent component took a sequence of words as an input. We set the basic parameters in the model as follows: 30 as the number of epochs, a batch size of 43 for sub-task A, since it was the smallest batch size that the 860 tweets could be divided by, where our model still performed well. For the other sub-tasks we went with 30 and 71 as batch sizes for the test sets of 240 and 213 tweets, accordingly. We used 4 hidden layers with 50 neurons per each, since our overall score declined by decreasing and increasing their number. Our dropout ratio was set to 0.95, the embedding size to 100, learning rates varied between submissions from 0.003 to 0.005.
The model was implemented in Python and makes use of Tensorflow (Abadi et al., 2015) and Scikit-learn (Pedregosa et al., 2011) libraries for training the classifier. We optimized our architecture parameters by predictions of support vector machine (SVM) model, described in (Rama and Çöltekin, 2017) and (Çöltekin and Rama, 2018). It used 'bag of n-grams' as features, and took not only word n-grams, as in our LSTM based model, but combined character and word n-grams, weighted by sublinear TF-IDF scaling. We picked the epoch with the best F 1 -score for each parameter setting according to these SVM predictions. Our repository can be found on github https://github.com/cicl2018/semeval-2019task-6-HAD.

Preprocessing
For neural network classification, data preprocessing has a great impact on the system's performance. Thus, at least one step from the following procedure was applied for all the submissions: • lowercasing, since uppercased words can be both offensive and not • hashtag parsing (e.g. #retrogaming → # retro gaming) (see, Baziotis et al. 2017) This tool is trained on 2 big corpora: -English Wikipedia a collection of 330 million English Twitter messages • removing tokens, containing "@USER" The user names are not given, thus this information is irrelevant for the classification task.

Sub-task A -Offensive language identification
Sub-task A was a binary classification task. The goal was to identify whether the post is offensive (OFF) or not (NOT). The provided tweets were labeled as OFF if they contained any form of nonacceptable language or a targeted offense, and labeled as NOT in any other case.
2.2.1 System pipeline for sub-task A Figure 1 describes the system architecture for subtask A. For each of the three submissions we tried different approaches.
1. All the preprocessing steps (Section 2.1) + LSTM classifier with the use of SVM predictions, (see Section 2.2.2 and green arrows in Figure 1).
2. All the preprocessing steps + LSTM classifier with SVM predictions + additional manually created offensive word list, (see Section 2.2.3 and black arrows in Figure 1).
3. Hashtag parsing as a single preprocessing step + LSTM classifier with SVM predictions, (see Section 2.2.4 and red arrows in Figure 1).

Submission 1, Sub-task A
In our first submission, we fed the preprocessed data into our LSTM model, setting the configurations (e.g. a learning rate of 0.003), according to the outcome of SVM predictions (Figure 1: green arrows).

Submission 2, Sub-task A
For the second submission we used a manually created additional offensive word list. After all the preprocessing steps, we ran the model with the same configurations as in the first submission except for the learning rate of 0.005, picking the epoch with the best F 1 -score regarding SVM predictions. Then we postprocessed the results by using external manually collected offensive vocabulary, reannotating the tweets as offensive, if they contained abusive words from this list, but were labeled as not offensive by our model (Figure 1: black arrows).

Submission 3, Sub-task A
As a third submission, we preprocessed raw tweets only by hashtag parsing and let an LSTM model with a learning rate of 0.005 classify the data, choosing the epoch with the best F 1 -score, according to the SVM predictions ( Figure 1: red arrows).

Sub-task B -Automatic categorization of offense types
Sub-task B was a classification task of targeted (TIN) vs. untargeted (UNT) tweets. The test set contained only offensive (OFF) posts from the first sub-task. Tweets were considered as targeted, if they were insults/ threats to an individual or group, untargeted in any other case. For this sub-task we reduced the initial training set of 13.240 tweets to 4300, removing the tweets labelled with NOT, since non-offensive tweets would not add any improvement to the learning model and might even distort the learning process.

System pipeline for sub-task B
The system architecture for this sub-task is illustrated in Figure 2. Since the number of the representative tweets in the training data differed between categories a lot (i.e. 524 and 3876 for UNT and TIN, respectively), we used a weighted cross entropy to balance the data. Like in sub-task A, our approaches varied between submissions but this time we handed in 2 submissions. The architecture of the first submission in this subtask is very much similar to the first submission in sub-task A with the only difference being that a learning rate was changed to 0.005 (Figure 2: green arrows).

Submission 2, Sub-task B
For the second submission we added a postprocessing step, where we reannotated the tweets that comprised particular word forms from a manually created list of potential insult victims as targets (Figure 2: black arrows). This database included following four parts: • Names of representatives of top twitter profiles from the USA, the UK, Saudi Arabia, Brazil, India and Spain, since these countries have the most Twitter users 2 and Iran, Iraq, Turkey, Russia and Germany, because we predicted a possible aggression towards the users from these countries. The data was obtained from https://www.socialbakers. com/statistics/twitter/profiles/.
• A list of ethnic slurs, mostly extracted from https://en.wikipedia.org/wiki/List_ of_ethnic_slurs • A list of name-callings, primarily collected from https://www.urbandictionary.com/ • A list of 2nd and 3rd personal pronouns and abbreviations with them (e.g. you, they've etc.)

Sub-task C -Offense target identification
The third sub-task addressed offense target identification. This time we had three categories to choose from: Individual (IND), group (GRP), or other (OTH). The tweets were labeled as individually targeted, if a potential victim was a famous person, a named IND or an unnamed person interacting in the conversation. It was labeled as GRP, if the tweet was offensive with respect to a group of people considered as a unity due to the same ethnicity, gender or sexual orientation, political affiliation, religious belief, or similar, and was labelled as OTH, if the tweet intended to abuse an organization, a situation, an event, or an issue. The test data contained 213 offensive targeted tweets from sub-task B. The training set of 4300 offensive tweets was reduced to 3909 targeted ones for this sub-task.

System pipeline for sub-task C
The system architecture for this sub-task is illustrated in Figure 3.

Submission 1, Sub-task C
This submission is reminiscent of the two previous first submissions, but the batch size was set to 71 and the learning rate to 0.003 (Figure 3: red arrows).

Submission 2, Sub-task C
In the second submission, we postprocessed the classified data, using the following datasets: • Names of representatives of top twitter profiles from the USA, the UK, Saudi Arabia, Brazil, India, Spain, Iran, Iraq, Turkey, Russia and Germany. The data was obtained from https://www.socialbakers. com/statistics/twitter/profiles/: celebrities and society/politics industries for identifying individual targets community/political and community/religion industries for recognizing group targets places, brands and entertainment/event industry for other targets • A list of ethnic slurs, (see Section 2.3.3), for identifying group targets • A dataset of name-callings, (see Section 2.

3.3), for recognizing individual victims
These datasets helped to classify the categories IND, GRP or OTH by looking them up in our lists. (Figure 3: green arrow).

Submission 3, Sub-task C
The third submission differed from the previous one only in adding a list of 2nd and 3rd personal pronouns including their contractions to the existing database for the postprocessing step. We decided to try an approach with personal pronouns despite the fact, that they can both target individuals (e.g. "Take it out, you fucking wanker, or I'll take you out".) and groups (e.g. "All you democrats suck, and your momma's fat!").

Results
The results presented below were obtained using the macro-averaged F 1 -score, provided by the organisers of OffensEval 2019. They included accuracy as well for comparison. Random baseline generated results by assigning the same labels for all instances were also added to the result Table 1. For example, "All OFF" in sub-task A represented the performance of a system that labels everything as offensive. The best results for the first sub-task were produced by the simplest approach, which included only hashtag parsing as a preprocessing step and an LSTM based classifier with configurations, set according to SVM predictions. A plausible explanation to the bad performance of the second submission with a lexical lookup is that a taskspecific lexicon should better be used as an input feature, which can only influence data classification, rather than as a decisive postprocessing step.
For sub-task B one can see the scores of our two submissions in Table 2. As before, the organizers have also included random baseline generated results by assigning the same labels for all instances.  The best results for this sub-task were also achieved only by applying preprocessing steps to an LSTM model. Most likely, the problem was that our external dataset largely aimed to recognize names of top twitter accounts, which most frequently occur as usernames in tweets. However, in our case they were anonymized in both training and test sets (@USER). Last table shows the scores of our submissions for sub-task C:  For the last sub-task, which was devoted to categorizing targets of offense, a considerable increase in F 1 -score can be observed by using the external datasets for postprocessing. Hence, the results showed that using a lexical lookup could be much more efficient in categorizing the possible victims than in identifying the presence of aggression per se. Below one can also find the confusion matrices of our best runs: It is also worth mentioning that a model choice and its settings should be made according to the training set size. In our case, the volume differed significantly for all the sub-tasks. However, a significantly lower performance of all the submissions can be observed on the last sub-task with the smallest training set.

Conclusion and future work
In our paper we presented the contribution of HAD-Tübingen to the OffensEval 2019 (SemEval 2019 -Task 6). Our approach combines sentence simplification as a preprocessing step and a lexical lookup as a postprocessing step with an unidirectional LSTM with 4 hidden layers. We picked the epochs according to the best F 1 -score for our model configurations, according to SVM predictions. We found out that simple LSTM models are not likely to outperform SVM in such classification tasks. However, as a next possible step in working with an LSTM based classifier, could be using an external task-specific lexicon as an input feature to our model, but not as a postprocessing step. We would also like to make use of the pretrained vectors from Fastext library that are based on sub-word character n-grams for improving our model.