YNU-HPCC at SemEval-2019 Task 6: Identifying and Categorising Offensive Language on Twitter

This document describes the submission of team YNU-HPCC to SemEval-2019 for three Sub-tasks of Task 6: Sub-task A, Sub-task B, and Sub-task C. We have submitted four systems to identify and categorise offensive language. The first subsystem is an attention-based 2-layer bidirectional long short-term memory (BiLSTM). The second subsystem is a voting ensemble of four different deep learning architectures. The third subsystem is a stacking ensemble of four different deep learning architectures. Finally, the fourth subsystem is a bidirectional encoder representations from transformers (BERT) model. Among our models, in Sub-task A, our first subsystem performed the best, ranking 16th among 103 teams; in Sub-task B, the second subsystem performed the best, ranking 12th among 75 teams; in Sub-task C, the fourth subsystem performed best, ranking 4th among 65 teams.


Introduction
Identifying offensive language (Zampieri et al., 2019b) on Twitter is a particularly challenging task because of the informal and creative writing style, with the improper use of grammar, figurative language, misspellings and slang, etc. In previous attempts of the task, OffensEval was generally tackled using hand-crafted features and/or sentiment lexicons by feeding them to classifiers such as Support Vector Machines (SVM). These approaches require a laborious feature-engineering process, which may also need domain-specific knowledge, usually resulting in both redundant and missing features. However, in recent years, artificial neural networks for feature learning have achieved good results in this field (Christos Baziotis, 2017).
In this document, we present four systems that competed at SemEval-2019 Task 6 (Zampieri et al., 2019b). The first model is a 2-layer BiLSTM, equipped with an attention mechanism. The second is voting scheme that combines a 2-layer BiLSTM, Capsule Network, 2-layer bidirectional gated recurrent unit (BiGRU), and the first model. The third model is a stacking scheme that combines a 2-layer BiLSTM, Capsule Network, 2-layer bidirectional gated recurrent unit (BiGRU), and the first model. In addition, the above three models, for the word representation, we have used the glove vector. The fourth model is BERT-BASE (Jacob Devlin, 2018), which was released last year by Google AI Language.
The remainder of this document is organised as follows. The related work is described in Section 2. Section 3 reports our methodology and data. Section 4 reports our result. The conclusions are summarised in Section 5.

Related Work
In recent years, with the rapid development of social media, the use of aggressive and offensive language as well as hate speech has gradually increased. To tackle this problematic behaviour, one of the most common strategies is to train systems capable of recognising them and either deleting them or setting them aside for human moderation.
Aggression can be divided into three categories: overt aggression, covert aggression, and nonaggression (Kumar et al., 2018). Last year, in a shared task, several participants used deep neural networks and traditional machine learning methods for aggression identification. The best performing systems in this competition used deeplearning approaches based on convolutional neural networks (CNN), BiLSTM, and long short-term memory (LSTM). Offensive Language is commonly defined as hurtful, derogatory or obscene comments made from one person to another. Currently, there is an increasing amount of such language online. Manually monitoring these posts would incur significant costs (Mathur et al., 2018). Therefore, the automatic identification of suspicious posts has emerges as a trend. In recent years, many researchers have studied the use of deeplearning and traditional machine learning methods for this purpose. Their results indicate that, although several deep-learning approaches produce good scores, traditional supervised classifiers can produce similar scores. Word embeddings, character n-grams and lexicons of offensive words are popular features, but all three components are not necessary for a robust system. Ensemble methods mostly help (Wiegand et al., 2018). Many previous studies still tend to equate offensive language and hate speech. However, through this method, we may erroneously classify many people as hate speakers by failing to differentiate between commonplace offensive language and genuine hate speech (Davidson et al., 2017;Fortuna and Nunes, 2018). In recent years, the recognition of hate speech has mainly focused on deeplearning methods, such as CNNs (Gambäck and Sikdar, 2017) and Convolution-GRU (Zhang et al., 2018).

Data
The datasets contain data from Twitter and were provided by the organisers. For Sub-task A, Sub-task B, and Sub-task C, the available datasets (Zampieri et al., 2019a) comprised all the training and testing data. In addition, because the organisers did not provide development, we decided to split 0.2 from the training as development. Table  1 shows the data provided by the organisers.
As shown in Table 1, there the data of the three Sub-tasks shows a significant imbalance.

Preprocessing
Initially, we received the training and testing data that had been preprocessed by the organisers. Subsequently, on this basis, we preprocessed the training and testing data again and finally applied it to a neural network. For preprocessing, we removed and replaced strings from the tweets that did not show any sentiments, irregularities, or abbreviations. We also removed duplicates and Unicode strings. These were implemented as follows: • Removing consecutive duplicates while retaining one item: we found that some instances of text were duplicates, e.g. "????" → "?".
• Replacing the emojis on Twitter with the corresponding English definition and replacing abbreviations: There were several emojis in the data conveying different emotions. In addition, the abbreviations in the data also restrict the corresponding emotional categories, e.g. "don't" → "do not".
• Replacing irregular words: we found that there were many irregular words in the data, e.g. "bro" → "brother".
• Removing some punctuation: preliminary experiments showed better results when we removed some punctuation; however, we detected emotive punctuation signs such as "!" and "?" and retained them.
and Stanford toolkits, we finally decided to use the Stanford toolkit, because of its better performance.

System
For SemEval-2019 Task 6, we used five basic models: • BiLSTM: BiLSTM is a combination of forward LSTM (LSTM is an artificial recurrent neural network (RNN) architecture; a common LSTM unit comprises a cell, an input gate, an output gate, and a forget gate.) and backward LSTM. Because BiLSTM can better represent bidirectional semantic dependencies, it is often used to model contextual information in natural language processing. In the three Sub-tasks, after several trial comparisons and time factors, we finally selected a 2-layer BiLSTM. In addition, the parameters of our model were chosen to maximise development performance: in Sub-task A, we initialised the hidden dimension, recurrent dropout, and batch size as 120, 0.25, and 128, respectively; in Sub-task B, we initialised the hidden dimension, recurrent dropout, and batch size as 120, 0.25, and 100, respectively; and in Sub-task C, we initialised the hidden dimension, recurrent dropout, and batch size as 140, 0.35, and 64, respectively.
• BiGRU: similarly, BiGRU is a combination of forward GRU (GRU, a variant of LST-M, has a simpler structure than LSTM and works well; there are only two gates in the GRU model, namely the update gate and the reset gate) and backward GRU. For the three Sub-tasks, we used a 2-layer BiGRU.
The parameters of our model were chosen to maximise development performance: in Subtasks A and B, we initialised the hidden dimension, recurrent dropout, and batch size as 120, 0.25, and 100, respectively; in Sub-task C, we initialised the hidden dimension, recurrent dropout, and batch size as 120, 0.25, and 128, respectively.
• BiLSTM with attention: For this, an attention layer was added to the 2-layer BiL-STM. In BiLSTM, we used the output vector of the last time sequence as the feature vector and then performed softmax classification. The attention layer is used to first calculate the weight of each time sequence, then take the weighted sum of all the time sequence vectors as feature vectors, and finally perform softmax classification. Similar to the previous models, the parameters of our model were as follows: in Sub-tasks A and B, we initialised the hidden dimension, recurrent dropout, and batch size as 120, 0.25, and 256, respectively; in Sub-task C, we initialised the hidden dimension, recurrent dropout, and batch size as 180, 0.3, and 128, respectively.
• Capsule Network: In the deep-learning model, the spatial patterns are summarised at the lower level, thus helping represent the concept of higher layers. For example, when a CNN models spatial information, it needs to copy the feature detector, which reduces the efficiency of the model. However, spatially insensitive methods are inevitably limited by rich text structures (such as the preservation of word location information, semantic information, and grammatical structure), which are difficult to encode effectively and lack text expression ability. Hinton et al. (Sara Sabour, 2017) proposed a Capsule Network, which replaces a single neuron node of a traditional neural network with a neuron vector and trains this new neural network through dynamic routing, effectively improving the shortcomings of the above two methods. The parameters of our model were as follows: in Sub-tasks A and B, we initialised the hidden dimension, batch size, and routing as 64, 120, and 15, respectively; in Sub-task C, we initialised the hidden dimension, batch size, and routing as 64, 140, and 15, respectively.
• BERT: The BERT model is a language model proposed by Google based on a bidirectional transformer. It is quite different from ELMo (Peters et al., 2018 For the four models of BiLSTM, BiGRU, BiL-STM with attention, and Capsule Network, first, the processed Twitter text was converted into a word vector matrix. Then the word vector matrix was processed by the embedded layer. Subsequently, the word vector matrix was converted to a computable vector matrix. Finally, the four models could utilise the vector matrix for training and prediction.

K-Fold Cross-Validation
We know from Section 3.1 that data imbalance exists in the public datasets published by the organisers. This would lead to unstable or inaccurate experimental results. To manage this problem, we used k-fold (k = 5) cross-validation: the training sample was randomly partitioned into 5 equal sized subsamples. Of the 5 subsamples, a single subsample was retained as validation data to test the model, and the remaining 4 subsamples were used as training data.

Task A
Sub-task A includes 13240 training instances and 860 testing instances, as well as OFF and NOT labels. We used four models for predictions on the testing sets. These four models were BERT (system ID: 528280), voting (system ID: 528117), stacking (system ID: 528015), and BiLSTM with attention (system ID: 528232). In the voting model, we performed soft voting ensemble on four basic models: BiLSTM, BiGRU, BiLSTM with attention, and Capsule Network. In the stacking model, we performed stacking ensemble on four basic models: BiLSTM, BiGRU, BiLSTM with attention, and Capsule Network. Our team results according to those provided by the task organisers are shown in Table 2. Among the results of the four models submitted by our team, the BiLSTM with attention model performed the best, and its F1 (macro) was 0.7877. The accuracy was 0.843, ranking 16th among all participants. In addition, from the confusion matrix in Figure 1, it is observed that when the classifier predicts two classes of labels, namely NOT and OFF, it is more specific to the NOT label, and the precision for the NOT label is higher than that for the OFF label.

Task B
Sub-task B continues on the OFF label of Sub-task A. It includes 4400 training instances and 240 testing instances, as well as TIN and UNT labels. We used BERT (system ID: 533313), voting (system ID: 533291), and BiLSTM with attention (system ID: 533311) for predictions on the testing sets. The results of our team according to those provided by the task organisers are shown in Table 3. Among the results of the three models submitted by our team, the voting model performed best; its F1 (macro) was 0.6811, its accuracy was 0.8625, and it ranked 12th among all participants. Similar to the previous Sub-task, the confusion matrix in Figure 2 indicates that, for the TIN and UNT labels, the classifier is more sensitive to TIN labels. In terms of precision, the value for the TIN label is also higher than that for the UNT label.

Task C
Sub-task C continues on the TIN label of Sub-task B. It includes 3876 training instances and 213 testing instances, as well as IND, OTH, and GRP labels. We used BERT (system ID: 536705) and voting (system ID: 537472) for predictions on the testing sets. The results of our team according to those provided by the task organisers are shown in Table 4. Among the results of the two models submitted by our team, the BERT model performed the best; its F1 (macro) was 0.6212, its accuracy was 0.7089, and it ranked 4th among all participants. Additionally, as shown in Figure  3, among the IND, OTH, and GRP labels, the highest recall and precision are for the IND labels, and the lowest are for the OTH labels.
For the three Sub-tasks, misclassifications of the classifier are likely due to data imbalance.

Conclusion
Identifying and categorising offensive language is a task that is drawing increasing attention. In this document, we described our four models submitted for Task 6 of the SemEval-2019 Workshop, which involved identifying and categorising offensive language on Twitter. These four models comprise not only traditional neural network models but also popular language models. Our model exhibited good performance in terms of the experimental results. In the three Sub-tasks, there appears to be significant room for improvement compared to the top-ranked participating systems. Therefore, in future work, we will focus on using more word embedding methods and managing data imbalance issues.