NULI at SemEval-2019 Task 6: Transfer Learning for Offensive Language Detection using Bidirectional Transformers

Transfer learning and domain adaptive learning have been applied to various fields including computer vision (e.g., image recognition) and natural language processing (e.g., text classification). One of the benefits of transfer learning is to learn effectively and efficiently from limited labeled data with a pre-trained model. In the shared task of identifying and categorizing offensive language in social media, we preprocess the dataset according to the language behaviors on social media, and then adapt and fine-tune the Bidirectional Encoder Representation from Transformer (BERT) pre-trained by Google AI Language team. Our team NULI wins the first place (1st) in Sub-task A - Offensive Language Identification and is ranked 4th and 18th in Sub-task B - Automatic Categorization of Offense Types and Sub-task C - Offense Target Identification respectively.


Introduction
Anti-social online behaviors, including cyberbullying, trolling and offensive language (Xu et al., 2012;Kwok and Wang, 2013;Cheng et al., 2017), are attracting more attention on different social networks. The intervention of such behaviors should be taken at the earliest opportunity. Automatic offensive language detection using machine learning algorithms becomes one solution to identifying such hostility and has shown promising performance.
In SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (Zampieri et al., 2019b), the organizers collected tweets through Twitter API and annotated them hierarchically regarding offensive language, offense type, and offense target. The task is divided into three sub-tasks: a) detecting if a post is offensive 1 https://github.com/google-research/bert (OFF) or not (NOT); b) identifying the offense type of an offensive post as targeted insult (TIN), targeted threat (TTH), or untargeted (UNT); c) for a post labeled as TIN/TTH in sub-task B, identifying the target of offense as individual (IND), group of people (GRP), organization or entity (ORG), or other (OTH). The three sub-tasks are independently evaluated by macro-F1 metric.
The challenges of this shared task include: a) comparatively small dataset makes it hard to train complex models; b) the characteristics of language on social media pose difficulties such as out-ofvocabulary words and ungrammatical sentences; c) the distribution of target classes is imbalanced and inconsistent between training and test data. To address the problem of out-of-vocabulary words especially emoji and hashtags, we preprocess each tweet by interpreting emoji as meaningful English phrases and segmenting hashtags into space separated words. The classifiers we experiment with include: linear model with features of word unigrams, word2vec, and Hatebase; word-based Long Short-Term Memory (LSTM); fine-tuned Bidirectional Encoder Representation from Transformer (BERT) (Devlin et al., 2018). We choose BERT for our official submission, since it performs the best in our experiments.
In the rest of this paper, we organize the content as follows: related work of hostility on social media is stated in section 2; section 3 introduces data description, details of preprocessing, and the methodology of our models; experimental results are discussed in section 4. We also present the conclusion of our work at the end of paper. Schmidt and Wiegand (2017) surveyed features widely used for hate speech detection, including simple surface feature, word generalization, knowledge-based features, etc.  reported hate speech detection results using word n-grams and sentiment lexicon and provided insights on misclassified examples. A proposal of typology of abusive language sub-tasks is presented in (Waseem et al., 2017). (Liu et al., 2018) also discuss that the forecasting of the future hostility on Instagram can be divided into two levels: presence and intensity. In addition to English, researchers also investigated offensive language detection for Chinese (Su et al., 2017) and Slovene (Fišer et al., 2017). In the shared task on aggression identification organised as part of the first workshop on trolling, aggression and cyberbullying (TRAC -1) at COL-ING 2018, word/character n-grams and word embeddings were the most commonly used features among the participants, and the most popular classifiers were SVM, LSTM, and RNN. The best performing system employed bidirectional LSTM on Glove embeddings.

Data Description
Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019a) is collected from Twitter API by searching certain keywords set. The keywords include some unbiased targeted phrase such as 'she is', 'he is' and 'you are' which have high proportional offensive tweets. The distribution of offensive tweets is controlled around 30% by using different sampling methods. Another observation reported in the paper is political tweets tend to be more likely offensive using keywords as 'MEGA', 'liberal' and 'conservative'.
The main task of this competition is decomposed into three different levels according to the hierarchical annotation: a) Offensive Language Detection b) Categorization of Offensive Language c) Offensive Language Target Identification. All the three different tasks share the same dataset, and the latter one is the subset of the previous one.
The tasks release the dataset into three different parts, which are the startingKit, training dataset and testing dataset. The summary of dataset distribution is concluded in the Table1. From the table, it is easy to observe that the distribution of three splittings is a little twisted which should be expected in real life, and also make the tasks much harder. NOT  243  8840  620  OFF  77  4400  280   TIN  38  3876  213  UNT  39  524  27   IND  30  2407  100  GRP  7  1074  78  OTH  2  395  35   Table 1: Data Distribution: The first two rows are the class distribution of sub-task A. The mid part two rows are the class distribution of sub-task B. The last three rows are the class distribution of sub-task C.

Preprocessing
Emoji substitution We use one online emoji project on github 2 which could map the emoji unicode to substituted phrase. We treat such phrases into regular English phrase thus it could maintain their semantic meanings, especially when the dataset size is limited.
HashTag segmentation The HashTag becomes a popular culture cross multi social networks, including Twitter, Instagram, Facebook etc. In order to detect whether the HashTag contains profanity words, we apply word segmentation using one open source on the github 3 . One typical example would be '#LunaticLeft' is segmented as 'Lunatic Left' which is obviously offensive in this case.

Misc.
We also convert all the text into lower case. 'URL' is substituted by 'http', since 'URL' does not have embedding representation in some pre-trained embedding and models. Consecutive '@USER's are limited to three times to reduce the redundancy.

Methodology
Linear model We firstly select Logistic Regression as our baseline model to determine the lower bound performance that we should compare. First we cross-validate hyper-parameters of different vectorizers to build bag of words representation. Secondly, we adopt the pre-trained word2vec model from google 4 , then aggregate the maximum and average value in each dimension.
(a) Sub-task A (b) Sub-task B (c) Sub-task C  Thirdly, we use the dictionary Hatebase API 5 to aggregate the hate words in each category. We validate all the features combinations, then report the accuracy and F1 with the highest to determine the model parameters.
LSTM Long Short-Term Memory is introduced in 1991 (Hochreiter and Schmidhuber, 1997) which is an more powerful extension of recurrent neural network. The gates inside of LSTM could prevent gradient vanishing problem, to memorize the long time dependency. LSTM has been used in tons of natural language processing task, such as sentiment classification, neural translation, language generation etc. We would also like to use LSTM as our second powerful baseline model to compare and report the result. The specific setting is the following: the input is mapped from onehot encoder into a shared embedding layers with dimension 140; the hidden units of LSTM is 64 and follower by a dropout layer with rate 0.5. The maximum sequence length is 140, thus the sentences would be either cut off or padded.
BERT Google research team releases Bidirectional Encoder Representation from Transformer (BERT) (Devlin et al., 2018) and achieve state of the art results on many NLP tasks. BERT uses identical multi-head transformer structure that is introduced in (Vaswani et al., 2017). The model is pre-trained on huge corpus from different sources.
Since the dataset size in this SemEval-2019 Task 6 is not that big, we pass the dataset into the pretrained BERT model, and report the loss and accuracy at each epoch. The observation from experiments shows that after 1st or 2nd epochs, the model converges fast and always get very lower loss on the validation set. In such case, in the sub-task B and sub-task C, we report the macro-F1 score after the model trains after 1st, 2nd and 3rd epoch.

Experiment Results
The evaluation metric of this task is Macro-F1, which is the unweighted-average F1 of all the classes. The imbalance distribution makes the macro-F1 hard to achieve, and usually the score is penalized by the minority class. Weighted-loss is one solution during the training time to balance the model not to lead to the majority class prediction.
In the table 2 and 3, we report the results of our dev-dataset and final test dataset. From the table 2, we list the performance of our three selected models for each sub-task. The data is stratified split into 9:1 as train and test. There is also one independent validation set to determine the model selection that is split from train set. One observation from the table shows the problem of imbalanced data, so that higher accuracy does not guarantee higher macro-F1 score. Thus the stop criterion is based on average loss of validation set we mentioned before. Based on the results of validation, we choose to use BERT as our selected model for the final submission.
In the table 3, it shows the results on official test dataset. It should be noticed that in the subtask A, we also submit one result of a Bagging classifier with number 50, and Logistic Regression is the weak classifier. The features are the same with linear model we mentioned before. The result from BERT model sub-task A achieves the 1st place among all the participants. BERT-3 denotes we train BERT with only 3 epochs. Same notation with the latter two sub-tables. In the sub-task B and sub-task C, the results are not as good as subtask A due to two reasons: 1) the class distribution is more skewed than that of sub-task A. 2) the number of training instance is much smaller than sub-task A. The worst performance is sub-task C, (a) Sub-task A (b) Sub-task B (c)    1, 2, and 3. This is another way to explain the results as we discussed before. The figures are provided by the organizers, and we use the figures to summarize test distribution in the table 1. In the previous section, we mentioned the discrepancy of class distribution between training and test datasets. For example, in sub-task C, the class 'OTH' constitutes 0.101 of the training data, while it makes up 0.164 of the test data. This adds difficulty to the task, however, we are often confronted with the same situation in real-world problems.

Conclusion
Offensive language and online hostility is crucial on the social network. The minority proportion of the nature and morphological language are the difficulties to achieve high performance. The Diversity and evolution of the language at different ages is another challenge for social media detection task. As a conclusion, our work shows the competitive results in this shared task using customized processing to dataset, as well as the power of pre-trained model. In real life, labeled data is always limited and requires expensive human labors. In such case, transfer learning is always a good option to get started. Domain adaption also has prior knowledge of specific domain before doing any modeling work on hand. How to tune the parameters is nontrivial, and there are a lot of more efficient ways to be explored, which could yield better performance.