LTL-UDE at SemEval-2019 Task 6: BERT and Two-Vote Classification for Categorizing Offensiveness

We present results for Subtask A and C of SemEval 2019 Shared Task 6. In Subtask A, we experiment with an embedding representation of postings and use BERT to categorize postings. Our best result reaches the 10th place (out of 103). In Subtask C, we applied a two-vote classification approach with minority fallback, which is placed on the 19th rank (out of 65).


Introduction
The Internet is frequently used for online debates and discussions, where individuals or groups are increasingly often verbally attacked. Online platform providers aim to remove such attacking posts or ideally, prevent them from being published. Manual verification of each posting by a human moderator is infeasible due to the high amount of postings created every day. Consequently, automated detection of such attacking postings is the only feasible way to counter this kind of hostility.
In this work, we present our results for the SemEval 2019 Shared Task 6: Identifying and Categorizing Offensive Language in Social Media (Zampieri et al., 2019b) on the OLID dataset (Zampieri et al., 2019a). Subtask A focuses on the binary distinction if a post is offensive or not, while Subtask C determines if the target is an individual, group, or other entity. Our submission for Subtask A ranks 10th, for Subtask C ranks 19th.
For Subtask A, we experiment with word listbased classification, using classifiers such as SVM or logistic regression based on sentence embeddings, and neural network-based models such as a Multi-layer Perceptron (MLP) and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018). We find that the SVM performs best on our development set, but BERT reaches the best result on the test dataset. Moreover, a learning curve experiment suggests that more training data will lead only to minor improvements. In Subtask C, we choose a two-vote classification approach, where we let two systems compete with a fallback to the minority class in case the systems disagree. This fallback approach has a high robustness between our development and the official test dataset.

Related Work
Detection of offensive or potentially hurtful online postings is investigated under a variety of names. Waseem et al. (2017) focuses on abusive language, Kumar et al. (2018) tackles the problem as aggression while Macbeth et al. (2013) approaches this problem as cyberbullying to mention just a few. Furthermore, the field of hate speech detection is strongly related, which aims at detecting a similar kind of online statements (Waseem and Hovy, 2016;Wojatzki et al., 2018).
Common approaches to detecting such socially unacceptable statements utilize rich feature sets consisting of word ngrams, surface forms and syntactical features (Warner and Hirschberg, 2012;Nobata et al., 2016). Human-knowledge is provided by word lists containing offenses as key words or phrases (Bassignana et al., 2018;Wiegand et al., 2018b). Xiang et al. (2012) approaches the task as topic modelling problem using Latent Dirichlet Allocation (Blei et al., 2003).
These tasks are tackled with feature engineering-based approaches such as SVM or regression models but also with convolutional neural networks (Wiegand et al., 2018b).
we experiment with the following approaches: Preprocessing We lowercase all postings and use the Ark Tokenizer (Gimpel et al., 2011) for word splitting. These preprocessing steps are used in all experiments.
Lexical Matching We use the following handcrafted word lists of abusive words: (i) Profane Word List 1 containing more than 1,300 English tokens, (ii) UdS Lexicon of Abusive Words 2 having 1,651 entries (Wiegand et al., 2018a), and (iii) Multilingual Lexicon of Words to Hurt from HurtLex (Bassignana et al., 2018) with 9,313 terms. 3 A posting is classified as offensive if it contains any words in the before mentioned lists.
Posting Embeddings We represent each posting by a dense embedding, which we create from word embeddings by summing up the vector values of the word representations. The resulting posting vector is re-scaled into the range zero to one. We use the pre-trained embeddings provided by Mikolov et al. (2018), which are trained on the common crawl corpus.
Multi-Layer-Perceptron (MLP) With the same pre-processing and feature extraction steps used as for shallow models described above, we train a MLP with 100 hidden units in Scikit-Learn with ReLu as activation function and Adam optimizer (Kingma and Ba, 2014). We initialize the neural network with the fasttext word embeddings provided by Mikolov et al. (2018).
BERT We use the provided pre-trained BERTbase model (Devlin et al., 2018) to create a vector representation of a posting. We fine-tune the model on the training data set using a sequencelength of 128 and batches of 32. We also investigate the impact of enriching the training dataset with additional data by using machine translation. We back and forth translate the training data to obtain paraphrases of the original training data, which we expect to improve model performance. We translated the data into Russian, Chinese, and Arabic and back to English using Google's translation service. We repeated the fine-tuning with this enriched dataset.
Ensemble We combine the best three approaches (BERT, SVM, and Logistic Regression) in an ensemble, which was reported to often account for improvements in a similar shared task for German (Wiegand et al., 2018c). We use the majority vote of these classifiers as the prediction. Table 1 shows the results for Subtask A. We report results on a self-created development dataset (25% of the original training data, 3,240 postings of which 1,048 postings are labeled as offensive and 2,192 as not offensive). We use the majority class as a baseline. On our dev dataset, we find that a SVM with the posting vector-representation achieves the best F-Score, followed by BERT. Contrary to our expectation, BERT's performance decreased by adding the machine-translated data. On the test dataset, we find BERT to perform best followed by the ensemble, which seems to add some additional robustness to the classification.   sifier. Figure 1 shows a learning curve computed over the provided training data with testing against the hold-out development set. We split the training data into equal-sized data blocks which are randomly distributed over labels and add an increasing number of data blocks to see the performance improvement by adding more data. The results shows that improving the machine learning model is a more promising strategy than providing even more data as the slope indicates only minor improvements if more data is added.

Subtask C: Offense Targets
The goal in this subtask is to identify the kind of target at which a tweet is directed at (i.e. at this point it is already known that the tweet is a targeted offense, just the target itself is not yet determined). A target is either an individual (IND), a group (GRP), or other (OTH), if none of the previously mentioned two categories apply. We apply the same approaches as already used in Subtask A.

Two-Vote Classification with Minority Fallback
Furthermore, an analysis of the class distribution showed that the class for other has comparatively few instances. This makes it challenging for a classifier to reliably detect such an underrepresented class. Therefore, we attempt to redefine the problem as a binary classification problem using two classifiers. If the two classifiers agree in their prediction, we take the predicted class (either individual or group). In case of an disagreement, we select the minority class, other, as prediction. Thus, we also alter the training data to contain only two classes. The labels of the under-represented other class are mapped for one classifier to individual and for the other one to group, which creates a kind of minority-class noise. Our intuition is, if both classifier overcome the uncertainty added by the (small) amount of noise, the prediction is considered reliable. Consequently, we consider a disagreement as evidence for assigning the minority class.
Results Table 2 shows the results. We find that our two vote classification approach, using two MLPs, reaches the highest F-Score on the development and test set. On the development set, we reach the best accuracy result with an SVM but the considerably lower F-Score shows a strong bias towards a single class. Moreover, MLP+MLP shows a high robustness when comparing the F-Score performance between development and test set.

Conclusion
In this paper, we presented our approach on identifying and categorizing offensive language in social media. We mostly rely on lexical and semantic features for all subtasks. Results shows that semantic features have a significant impact on system performance. In general, our system leaves much room for improvement. Detection of offensiveness could probably benefit from more semantically oriented features that go beyond the surface form of words. We make the source code of our experiments publicly available 4 .