ConvAI at SemEval-2019 Task 6: Offensive Language Identification and Categorization with Perspective and BERT

This paper presents the application of two strong baseline systems for toxicity detection and evaluates their performance in identifying and categorizing offensive language in social media. PERSPECTIVE is an API, that serves multiple machine learning models for the improvement of conversations online, as well as a toxicity detection system, trained on a wide variety of comments from platforms across the Internet. BERT is a recently popular language representation model, fine tuned per task and achieving state of the art performance in multiple NLP tasks. PERSPECTIVE performed better than BERT in detecting toxicity, but BERT was much better in categorizing the offensive type. Both baselines were ranked surprisingly high in the SEMEVAL-2019 OFFENSEVAL competition, PERSPECTIVE in detecting an offensive post (12th) and BERT in categorizing it (11th). The main contribution of this paper is the assessment of two strong baselines for the identification (PERSPECTIVE) and the categorization (BERT) of offensive language with little or no additional training data.


Introduction
Offensive language detection refers to computational approaches for detecting abusive language, such as threats, insults, calumniation, discrimination, swearing (Pavlopoulos et al., 2017b), which could be targeted (at an individual or group) or not (Waseem et al., 2017). These computational approaches are often used by moderators who face an increasing volume of abusive content and would like assistance in managing it efficiently. 1 Although offensive language detection is not a new task (Dinakar et al., 2011;Dadvar et al., 2013;Kwok and Wang, 2013;Burnap and Williams, 2015;Tulkens et al., 2016), the creation of large 1 See, for example, https://goo.gl/VQNDNX. corpora (Wulczyn et al., 2017), along with recent advances in pre-training text representations (Devlin et al., 2018) allow for much more efficient approaches. Furthermore, while new competitions and corpora are being introduced (Zampieri et al., 2019a), 2 there is a need for strong baselines to assess the performance of more complex systems. This paper assesses two systems for the detection and categorization of offensive language, which require few or no task-specific annotated training instances.
The first baseline is a Convolutional Neural Network (CNN) for toxicity detection, trained on millions of user comments from different online publishers, which is made publicly available through the Perspective API. 3 This model requires no extra training or fine tuning and can be directly applied to score unseen posts. The second strong baseline is the recently popular Bidirectional Encoder Representations from Transformers (BERT), a pre-trained model that has been reported to achieve state of the art performance in multiple NLP tasks with limited fine-tuning on task-specific training data (Devlin et al., 2018).
Section 2 below summarizes related work and Section 3 discusses the SEMEVAL-2019 OFFEN-SEVAL dataset we used. In Section 4 we describe the two proposed baselines and we report experimental results in Section 5. Section 6 concludes our work and suggests future directions.

Related Work
Various forms of offensive language detection have recently attracted a lot of attention (Nobata et al., 2016;Pavlopoulos et al., 2017b;Park and Fung, 2017;Wulczyn et al., 2017). Apart from the growing volume of popular press concerning toxicity online, the increased interest in research into offensive language is partly due to the recent Workshops on Abusive Language Online, 4 as well as other fora, such as GermEval for German texts, 5 or TA-COS 6 and TRAC (Kumar et al., 2018), 7 . The literature contains many terms for different kinds of offensive language: toxic, abusive, hateful, attacking, etc. Largely, these are defined by different survey methods. In (Waseem et al., 2017), abusive language is divided into explicit vs. implicit, and directed vs. generalized. However, other researchers have created different taxonomies based on sub-kinds of toxic language ( Table 2).
Although some previous research has considered several types of abuse and their relations , detecting varieties of hate has attracted more attention (Djuric et al., 2015;Malmasi and Zampieri, 2017;ElSherief et al., 2018;Gambäck and Sikdar, 2017;Zhang et al., 2018). The first publicly available dataset for hate speech detection was that of Waseem and Hovy (2016). It contained 1607 English tweets annotated for sexism and racism. A larger dataset was published by , containing approx. 25K tweets collected by using a hate lexicon. Despite the popularity of hate speech detection in literature, no larger publicly available hate speech datasets seem to exist. For recent overviews of hate speech detection, consult Schmidt and Wiegand (2017) and Fortuna and Nunes (2018).
Research into the various kinds of offensive language detection is mainly focused on English, but some work in other languages also exists. Work on a large dataset of Greek moderated news portal comments is presented by Pavlopoulos et al. (2017a). A dataset of obscene and offensive user comments and words in Arabic social media was presented by Mubarak et al. (2017). Previous work includes a system to detect and rephrase profanity in Chinese (Su et al., 2017), and an annotation schema for unacceptable social media content in Slovene (Fišer et al., 2017).

Data
The SEMEVAL-2019 OFFENSEVAL dataset that is available to participants contains 13240 tweets; the counts of the labels are shown in Table 1. The OFFENSEVAL task consists of three subtasks, described in detail by Zampieri et al. (2019b). Subtask A aims at the detection of offensive language (OFF or NOT in Table 3). Subtask B aims at categorizing offensive language as targeting a specific entity (TIN) or not (UNT). Subtask C aims to identify whether the target of an offensive post is an individual (IND), a group (GRP), or unknown (OTH). Table 1 also shows the size of the vocabulary per class (label), which, unsurprisingly, is proportional to the class size. It is worth noting that offensive tweets targeting a group are the lengthier texts, with 28 tokens on average (see Table 1, column C, GRP column).

Baselines
We now describe the two baselines (Perspective, BERT) that we implemented and evaluated.

Perspective
We employed the Perspective API, which was created by Jigsaw and Google's Counter Abuse Technology team in Conversation-AI, 8 to facilitate better conversations online and protect voices in conversations (Hosseini et al., 2017). Although opensource code is available, 9 we chose to use pretrained models, accessible through the API. For offensive language detection in Subtask A, we used the Toxicity model, which is a CNN based on GLOVE word embeddings, 10 trained over millions of user comments from publishers such as the New York Times and Wikipedia. This is a robust model, which we expect to be somewhat adaptable to different datasets (and their labels for closely related forms of offensive language), such as the offensive tweets of OFFENSEVAL. For offensive language categorization in Subtask B, we employed other experimental models, also available via the Perspective API, which detect various abuse types including those of  2017). It is pre-trained to detect (a) a masked word from its left and right context, and (b) the next sentence. We used the publicly available BERT-BASE version, 11 with 12 Transformer layers, 768 hidden states size, which is pre-trained on a monolingual corpus of 3.3B words. For a particular NLP task, a task-specific layer is added on top of BERT.
In our case, the extra layer comprises dropout, a linear transformation, and softmax. 12 During the task-specific 'fine-tuning', the extra layer is trained jointly with BERT (refining the pre-trained BERT model) on task-specific data. Previous research demonstrated that fine-tuning BERT leads to state of the art performance in several NLP tasks (Devlin et al., 2018).

Offensive Language Detection
For Subtask A, we used the toxicity score from Perspective and returned the offensive label (OFF) when the returned score was above 0.5. No fine tuning was performed for Perspective. For BERT, we split the dataset to training (10K tweets) and development (3240) subsets, and fine-tuned BERT for 3 epochs. 13 In this subtask, Perspective outperformed BERT and was ranked 12th out of 103 submissions. The difference from the top-ranked model was 3.5 F1 points. The performance of Perspective in this subtask is particularly interesting, considering that the training data for these models were not labeled for offensiveness, but rather for other attributes such as toxicity, threats, and insults. 14 Ignoring Perspective, BERT was ranked 11 https://goo.gl/95mqhE 12 We used default values for all hyper parameters. 13 We used the uncased system with batch size 32, based on preliminary experiments. 14 https://goo.gl/Bmiogb 27th. As shown in Table 3, both of our strong baselines outperform the naive majority baselines for this subtask. The confusion matrix of Perspective is shown in Fig. 1. Both recall and precision are high for the NOT label (87.96% and 89.81%), but lower for OFF (68.33% and 71.62%). This is explained by the fact that NOT is two times the size of OFF (Table 1). We also used Perspective to score the training data, since no fine-tuning was performed on the training data for Perspective. Macro F1 was 78.01% (85.02% for NOT, 71% for OFF) and accuracy was 80.24%, which are lower but close to the respective values on the test data (Table 3).

Offense Type Detection
For Subtask B, we used the experimental insult, threat and attack on commenter models from Perspective. We averaged insult and attack on commenter and used this average to compare with the threat score. The Perspective baseline returned a targeted insult/threat (TIN) when the average was greater, and untargeted (UNT) otherwise. The BERT baseline was fine-tuned on the entire dataset that was available to participants, because we considered that dataset too small for a training/development split. 15 BERT clearly outperformed the Perspective baseline (Table 4) and ranked 11th in this subtask among 73 participants, whereas the best system achieved 7.8 points higher in F1. The confusion matrix of BERT for this subtask is shown in Fig. 2. The large class imbalance (TIN tweets are approx. 7 times than UNT, see Table 1) significantly reduces both the recall (44.44%) and precision (42.86%) of BERT for the UNT class, compared to TIN (92.49% and 92.92%, respectively).

Offense Target Detection
For Subtask C, Perspective has no suitable model to respond yet and the BERT-based systems submitted were in an experimental phase, due to time constraints. 16 We consider the results we obtained for this subtask as not relevant and leave the development and evaluation of baselines for this subtask as future work.

Conclusion
This paper proposed and evaluated two strong baselines, based on the Perspective API and BERT, for identifying and categorizing offensive language in social media. Both baselines require few (BERT) or no additional task-specific training data (Perspective) and this is the first work, to our knowledge, to assess their performance in the tasks we considered. The Perspective-based baseline was ranked 12th among 103 submissions for the task of classifying a post as offensive or not. The BERT baseline was ranked 11th among 73 submissions for the task of recognizing whether an offensive post is targeted or not. Both baselines were ranked surprisingly high in the corresponding tasks, considering that they were given no or few, respectively, additional task-specific training instances. Furthermore, the Perspective baseline, which required no fine tuning outperformed BERT by a large margin in the task of detecting offensive language. In future work, we intend to examine stronger, yet easy to apply baselines, and release source to make it easier to use them.