NLPR@SRPOL at SemEval-2019 Task 6 and Task 5: Linguistically enhanced deep learning offensive sentence classifier

The paper presents a system developed for the SemEval-2019 competition Task 5 hat- Eval Basile et al. (2019) (team name: LU Team) and Task 6 OffensEval Zampieri et al. (2019b) (team name: NLPR@SRPOL), where we achieved 2nd position in Subtask C. The system combines in an ensemble several models (LSTM, Transformer, OpenAI’s GPT, Random forest, SVM) with various embeddings (custom, ELMo, fastText, Universal Encoder) together with additional linguistic features (number of blacklisted words, special characters, etc.). The system works with a multi-tier blacklist and a large corpus of crawled data, annotated for general offensiveness. In the paper we do an extensive analysis of our results and show how the combination of features and embedding affect the performance of the models.


Introduction
In 2017 two-thirds of all adults in the United States have experienced some form of online harassment (Duggan, 2017). 1 This, together with various episodes of online harassment, boosted research on the general problem of recognizing and/or filtering offensive language on the Internet. Still, recognizing if a sentence expresses hate speech against immigrants or women, understanding if a sentence is offensive to a group of people, an individual or others -these tasks continue to be very difficult for neural networks and machine learning models to accomplish. In order to do this, various implementations have been proposed; for the most successful recent approaches see Pitsilis et al. (2018); Founta et al. (2018); Wulczyn et al. 1 Due to the topic of the SemEval-2019 Tasks 5 and 6, the present paper contains offensive expressions spelled out in full. These are solely illustrations of the problems under consideration. They should not be interpreted as expressing our views in any way.
This article presents a system that we have implemented for recognizing if a sentence is offensive. The system was developed for two SemEval-2019 competition tasks: Task 5 hatEval "Multilingual detection of hate speech against immigrants and women in Twitter" Basile et al. (2019) (team name: LU Team) and Task 6 OffensEval "Identifying and categorizing offensive language in social media" (Zampieri et al., 2019b) (team name: NLPR@SRPOL). Table 1 shows the results that we achieved with our system in the SemEval-2019 competitions.

Competition
Placement Task 6-A 8 th position Task 6-B 9 th position Task 6-C 2 nd position Task 5-A 8 th position (ex aequo) In order to create a highly accurate classifier, we combined state-of-the-art AI with linguistic findings on the pragmatic category of impoliteness (Culpeper, 2011;Jay and Janschewitz, 2008;Brown and Levinson, 1987). We achieved this by deciding on the factors that point to the impoliteness of a given expression (for the blacklists) or the entire sentence (for corpus annotation). Such factors led us to divide the blacklist into "offensive" and "offensive in context", as most linguistic studies of impoliteness focus on various aspects of the context. Furthermore, linguistic research made it possible to arrive at a maximally general definition of offensiveness for the crowdsourced annotators.
The article is organized as follows. Section 2 presents the current state of the art for offensive sentence classification. Section 3 explains the architecture of our system (features, models and ensembles). Section 4 describes the datasets and how they were created. Section 5 shows the results of the SemEval-2019 tasks in detail, motivating which combination of features and models was the best. Finally, section 6 offers conclusions together with our plans for future research.

Related work
In recent years, the problem of recognizing if a sentence is offensive or not has become an important topic in the machine learning literature. The problem itself has different declinations depending on the point of view. Currently there are three main areas of research in this topic in the machine/deep learning community: 1. Distinguishing offensive language from nonoffensive language; 2. Solving biases in deep learning systems; 3. Recognizing more specific forms of offensivness (e.g. racism, sexism etc.).
The main problem with each of the tasks is the amount of data available to researchers for experimenting with their systems. This -together with the fact that it is difficult to clearly define what is offensive/racist/sexist or not -makes the three problems listed above very difficult for a deep learning system to solve. Articles have showed that there is a strong bias in text and embeddings, and have tried to solve this bias using different techniques (Zhao et al., 2018;Bolukbasi et al., 2016). Furthermore, thanks to a dataset defined in Waseem and Hovy (2016) and Waseem (2016), various works have gone in the direction of recognizing sexism and racism in tweets (Pitsilis et al., 2018;Park and Fung, 2017).
Linguistic expertise enhanced the functionality at two stages: sentence annotation (described in detail in Section 4) and active creation of blacklists (Section 3). The completion of these tasks breaks new ground, as there exist no corpus linguistic studies on the generality of offensive language, to the best of our knowledge. Recent approaches of narrower scope are Dewaele (2015) and McEnery (2006).

System description
Our system is composed of three major elements, described below: • Features -common to all models; • Various models -neural networks or not; • Ensemble.

Features
This section describes the features that we used and explains their role. We implemented the following features: • Number of blacklisted words in the sentence; • Number of special characters, uppercase characters, etc.; • A language model taught to recognize offensive and not offensive words.
Blacklisted We used two kinds of blacklisted expressions: "offensive" and "offensive in context". The "offensive in context" expressions are offensive in specific contexts and unoffensive otherwise, e.g. bloody or pearl necklace. This dictionary was compiled by crowdsourcing and contains about 2,300 words (+ variations). The blacklist consists of swear words, invectives, profanities, slurs and other impolite expressions.
Special characters, uppercase, etc. We checked the graphemic characteristics of the written text and we gave this as a feature to the model. We mainly used the non user related features defined in Founta et al. (2018).
Language model Inspired by the work of Yu et al. (2018), we decided to train a language model on both offensive and non-offensive words. For this purpose, we trained two character based language models, one on the offensive dictionary (described above) and the other from a corpus of nonoffensive words. After training them we used the difference in perplexity of each input word as a feature for the model.

Models
We trained various models and then combined them in an ensemble. This section outlines the models that were part of the ensemble.
Embeddings Both the Neural networks and the machine learning models used embeddings. We used the following embeddings: ELMo (Peters et al., 2018), fastText (Bojanowski et al., 2017), custom embeddings, and Universal Sentence Encoder (Cer et al., 2018). For fastText, we used the 1 million word (300d) vectors trained on Wikipedia 2017, below called fastText 1M. The custom embedding was built by training a fastText embedding on our corpus. We then combined the 1M fastText embeddings with these custom embeddings using Truncated SVD after concatenating their columns (this was done inspired by the work (Speer et al., 2017)). Building custom embeddings was important for the offensive word classification because the original version of the fastText 1M embeddings contained around 50% of the words in the corpus while after adding the custom embeddings, only 30% of the words were out of the vocabulary. Below, this combination of embeddings is called "combined".
For both models, we used multi-head attention and we tried different embeddings. In most cases, the Transformer models had better results than the LSTM models, and this is what we used in the submissions. The parameters of the models are described in Appendix A.1. In both models, the Features described in Section 3.1 are concatenated to the output of the model.
OpenAI GPT One of the models that we used was the OpenAI GPT (Radford et al., 2018). We used the GPT model in its original form, without changing any parameters. Our results show that this model works very well when there is enough data for finetuning. However, small classes -as in Task 6 Subtask C -pose a problem (see Section 5).
Machine learning models We used two machine learning models: • Random forest; • SVM.
For these models, we built a pipeline where: • In a first step we either compute the embeddings of the sentences or get the Td-Idf score. • In a second step we concatenate the result of the first step with the Features described in Section 3.1 (if used). • We run the classifier.
As embeddings we used only the Universal Encoder, and with good results.
Ensemble For the ensemble we used a voting classifier with soft voting (based on the probability returned by each model). For each subtask, we show which combination of models gave the best results.
The pipeline for the entire offensive sentence classifier is shown in Figure 1.

Preprocessing
Preprocessing plays a crucial role in the analysis of potentially offensive sentences, because most inputs use highly non-standard language. Hence, preprocessing was mainly focused on normalizing the language for simplifying the model work. We applied the following preprocessing: • Substituting user names with <USER> tokens; • Removing all links; • Normalizing words and letters; • Normalizing spacing and non-standard characters; • Over/Downsampling of the classes; After the preprocessing, we split by space and used each token as an input to the models.

Normalizing words and letters
We have a dictionary containing common spelling variants of words found in our corpus. We used this to change words to the "canonical" form. Examples of such variants can be seen in Table 2.

Datasets
In this section we give a high level overview of the datasets we used for training our models for the our own custom-built corpora and datasets provided by the SemEval organizers.
From the sources listed above, we added a total of 20,399 sentences to the SemEval-2019 corpus for Task 5, and 97,759 sentences to the one for Task 6.
Custom Offensive language corpus Our custom dataset was built by crowdsourcing and by crawling content from the Internet. The dataset is balanced, with 49,179 not offensive and 48,580 offensive comments. Around half of the dataset was labeled by linguists, who were asked to look for "general offensiveness". This could take various forms: To each sentence, the linguists assigned one of the three labels: • OFF -offensive sentence, • NOT -not offensive sentence, • Nonsense -random collection of words or non-English (removed from the corpus).
In cases of disagreement between linguists, we chose the most popular label, if applicable, or obtained an expert annotation. We calculated Fleiss' kappa for inter-annotator agreement (Fleiss, 1971), which extends Cohen's kappa to more than two raters (Cohen, 1960). For random ratings Fleiss' κ = 0, while for perfect agreement κ = 1. Our κ was equal to 0.62, which falls in the "substantial agreement" category, according to Landis and Koch (1977).
The remaining part of the corpus was assessed automatically with a blacklist-based filter.
Dataset for Task 6 The OLID dataset (Zampieri et al., 2019a) contains Offensive and Not Offensive sentences. The Offensive sentences are further categorized into: • TIN -targeted insults and threats, • UNT -untargeted. and the targeted (TIN) category was further subdivided into: • IND -individual target, • GRP -group target, • OTH -a target that is neither an individual nor a group.
Our full offensive language corpus, described in the previous subsection, was used for this task. The OFF sentences were further annotated for the two categories while the NOT sentences were not further annotated. All the additional classes were added automatically by a wordlist-based annotator.
Dataset for Task 5 The dataset for Task 5 (Basile et al., 2019) contained the classes: • HATE -hate speech against women or immigrants, • NOHATE -no hate speech against women or immigrants.
together with other subclasses. Given that we participated in the Task 5 Subtask A, we annotated our corpus only with these two labels. Using a mixture of automated and manual annotation, we were able to add around 30k sentences from our dataset for this task.

Results
SemEval In Table 3 we show the average F1 of our models for all the SemEval-2019 Tasks and Subtasks. These results were obtained by using an ensemble of models and in Table 4  Given the short amount of time, during the SemEval competition we were unable to test all the combinations of models and data preparation types to choose the best combination for the Ensemble. We thus selected the models in the ensemble by experimenting with part of the models.
Models 6-A 6-B 6-C 5-A GPT * RF RF+U T+EL T+CO+U * T+EL+U+F Table 4: Ensemble detail. The models marked with * have been trained with an unbalanced dataset. This is the main reason why only one model used in Task 5 contains additional features (the TELUF model).
After the competition, we tried the models contained in the Ensembles on all the tasks; detailed results are presented in Table 9 in the Appendix. It is important to note though that the results in the Appendix cannot be directly compared with the ones of the SemEval competition because although the models were the same, the Test data was different (the golden data has not been released yet).
From the results we clearly see that we have two "data regimes": in the low data regime (Task 6 Subtask B and C), Random forest (with or without the Universal embeddings) is the best choice. However, in the big(ger) data regime, Fine tune is the best model. Also in the low data regime each model works best with a different data preparation strategy: GPT with unbalanced data, the Transformer with oversampled and downsampled data while Random forest with oversampled data.
Ablation studies In this part we show the results of ablation studies on the transformer and random forest models. In this study, we want to understand how far the final result was influenced by the linguistically based features and preprocessing we defined in this article. All the results obtained in this section have been computed on a Test set created from the Train set shared in the SemEval-2019 tasks (as in Appendix A.3). As we discussed in the previous section, the Tasks were characterized by a "low" (Task 6 C) and a "big" data regime   Table 5. The table shows that, in the big data regime, the Random Forest works best when only the Universal Encoder is used, while the Transformer model improves its performance when the features are added. On the other hand, in the low data regime, we see that the plain Random Forest outperforms all the other combinations. This is probably because the more things we add, the more the model needs to learn, and with little data this is simply not possible.
In a second study we wanted to understand how much the normalization defined in Section 4.1 affected the performance of the model. For this reason, we trained again the best models in Table 5 for both Subtasks with an unnormalized version of the dataset. The results are that for Subtask A, the model T + CO + F F1 decreased from 0.75 to 0.73 while for Subtask C, the RF F1 decreased from 0.54 to 0.44.
The results of this section seem to point to the fact that the features we added and the normalization we used are beneficial for the performance of the models. Further work will be devoted to understanding this point though.

Conclusions
The article presented our approach to making a classifier recognizing offensive expressions in text. It showed how our architecture is suitable for multiple (related) offensive sentence classification tasks. It also showed how we built the fea-tures and the data that the model used for learning. Thanks to our system, we were 2 nd in the SemEval-2019 Task 6/Subtask C. In the article we also showed with ablation studies that the linguistic features proposed and the embeddings added improve the performance of the models we used.
In the future, we will extend our system to recognize a wider set of features. We are currently working on analyzing the linguistic differences between the offensive corpus and the non-offensive corpus. Specifically, we think that by analyzing the differences, we should be able to build a "white-list" of terms that can be used as features that will help the classifier understand which sentences are less likely to be offensive.

A.2 Data
In Table 6 we show the amount of data that was contained in our corpus (overall). In Table 7

A.3 Model results
In this section we show the detailed results of all the models for all the SemEval-2019 tasks. For each Task, we extracted a test set from the Train data released by SemEval. We compared the models to one of the current state of the art defined in Park and Fung (2017); the results shown here are Class Targeting Target  Total   OFF  TIN   IND  18,506  GRP  6,761  OTH  1,025  Total  34,669  UNT  -6,234  Total  -59,837  NOT --64,773 obtained by averaging the best F1 for each class (not a single model). The data by Waseem and Hovy (2016) for comparing to the state-of-the-art model has been kindly shared by the authors of Park and Fung (2017). In the table we marked with • No additional mark: the normalized data with oversampling and downsampling as described in Section 4. • FULL: the normalized data with oversampling but no downsampling. • UNB: the normalized data without oversampling or downsampling.
The model acronyms are the same as the ones used in Section 5.
Model 5-A 6-A 6-B 6-B FULL 6-C 6-C FULL  Table 9: Macro F1 for all the models on all the Tasks and on the state-of-the-art (SOTA) data.