Neural Word Decomposition Models for Abusive Language Detection

The text we see in social media suffers from lots of undesired characterstics like hatespeech, abusive language, insults etc. The nature of this text is also very different compared to the traditional text we see in news with lots of obfuscated words, intended typos. This poses several robustness challenges to many natural language processing (NLP) techniques developed for traditional text. Many techniques proposed in the recent times such as charecter encoding models, subword models, byte pair encoding to extract subwords can aid in dealing with few of these nuances. In our work, we analyze the effectiveness of each of the above techniques, compare and contrast various word decomposition techniques when used in combination with others. We experiment with recent advances of finetuning pretrained language models, and demonstrate their robustness to domain shift. We also show our approaches achieve state of the art performance on Wikipedia attack, toxicity datasets, and Twitter hatespeech dataset.


Introduction
In recent years, with the growing popularity of social media applications, there has been an exponential increase in the amount of user-generated text including microblog posts, status updates and comments posted on the web. The power of communicating freely with large number of users has resulted in not only sharing news and exchanging content but has also led to a problem of large number of harmful, offensive and aggressive interactions online (Duggan, 2017). Previous work on identifying abusive language has tackled this problem by training computational methods that are capable of automatically recognizing offensive content for text on MySpace (Yin et al., 2009), Twitter (Waseem and Hovy, 2016;Davidson et al., 2017), Wikipedia comments (Wulczyn et al., 2017) and Facebook posts (Vigna et al., 2017;Kumar et al., 2018).
Most of these models are based on features extracted from words or word n-grams or the recurrent neural networks that operate on word embeddings (Pavlopoulos et al., 2017;Badjatiya et al., 2017) with few exceptions of models that utilize character n-grams that can model noisy text and out-of-vocabulary words (Waseem and Hovy, 2016;Nobata et al., 2016;Wulczyn et al., 2017). However, these models are not very effective at modeling obfuscated words such as w0m3n, nlgg3r which are prominent in user generated text that are intended at evading hate speech detection (Mishra et al., 2018). In this work, we aim to address this by investigating word, subword and character-based models for abusive language detection.
Recent advances in unsupervised pre-training of language models have led to strong improvements on various general natural language processing and understanding tasks such as question answering, sentiment and natural language inference (Peters et al., 2018;Devlin et al., 2018). However, it is unclear how such models trained on standard text would transfer information when fine-tuned on noisy user generated text. In additional to studying word, subword and character-based model performances on abusive language detection we also combine these with pre-trained embeddings and fine-tuning these pre-trained language models and understand their efficiency and robustness in identifying abusive text.
Specifically, in this work, we address the effectiveness of character-based models, subword or Byte Pair Encoding (BPE) based models and word features based models along with pre-trained word embeddings and fine tuning pretrained language models for detecting abusive language in text. Precisely we make following contributions: • We compare the effectiveness of end-to-end character based models, with word + character embedding models, byte pair encoding and subword models, to show which of the techniques perform better than pure word based models.
• We demonstrate how fine-tuning large pretrained language models, the latest breakthrough in NLP, enhance state of the art on few of the abusive language datasets, and show that the domain shift isn't considerable when applied to abusive language datasets.
• We also examine how preprocessing documents with byte pair encoding model pretrained on a large corpus, boost the performance of several word embedding based models massively.

Related Work
Identifying abusive context on the web is one of the widely studied topics on social media text. This problem has been studied for Hate Speech detection (Kwok and Wang, 2013;Waseem and Hovy, 2016;Waseem, 2016;Ross et al., 2016;Saleem et al., 2017;Warner and Hirschberg, 2012), Harassment (Yin et al., 2009;Cheng et al., 2015), Cyberbullying (Willard, 2007;Tokunaga, 2010;Schrock and Boyd, 2011), Abusive language detection (Sahlgren et al., 2018;Nobata et al., 2016), aggression identification (Kumar et al., 2018;Aroyehun and Gelbukh, 2018;Modha et al., 2018), identifying toxic comments on forums (Wulczyn et al., 2017) and offensive language identification (Wiegand et al., 2018;Zampieri et al., 2019). While most of the work in identifying abusive on social media is predominantly studied for English social media posts some of the latest work include study on German (Wiegand et al., 2018), Italian (Bosco et al., 2018) and Mexican Spanish (Álvarez-Carmona et al., 2018). Some of the early methods on identifying abusive text used word n-gram, part-of-speech (POS) tagging (syntactic features), manually created profanity lexicons or stereotypical words, TF-IDF features along with sentiment and contextual features and trained supervised classifiers such as support vector machines. (Yin et al., 2009;Warner and Hirschberg, 2012). Waseem (2016) studied character n-grams, skipgrams, brown clusters and POS tag based features for identifying hatespeech. Waseem and Hovy (2016) studied usefulness of various socio-linguistic features such as gender, location, word-length distribution, Author Historical Salient Terms (AHST) features in identifying hatespeech.
Some of the recent work compared efficiency of both character n-gram based models as inputs to logistic regression and multi-layer perceptron models (Wulczyn et al., 2017). Nobata et al. (2016) showed that character n-grams features alone can perform well and can efficiently model noisy text. They also showed off-the-shelf word embeddings can be used to identify abusive text. Pavlopoulos et al. (2017) used deep-learning based models specifically they employed RNN with a novel classification-specific attention mechanism and achieve state-of-the-art results on identifying attack and toxic content in Wikipedia comments. Badjatiya et al. (2017) investigated three different neural networks for hatespeech detection: (i) Convolutional neural network (inspired by CNN's for sentiment classification by Kim (2014)) (ii) Long short-term memory networks (LSTM) to capture long range dependencies and (iii) FastText classification model that represents document by averaging word vectors that can be fine-tuned for the hatespeech task.
While Badjatiya et al. (2017) analyzed various architectures to encode text for hatespeech detection, we are not aware of any work that studied various word decomposition models for identifying abusive language in text. Recent work on identifying offensive language in text include fine-tuning large pretrained languege model BERT which use subword units to encode text (Zampieri et al., 2019;Zhu et al., 2019). For the SEMEVAL-2019 task of offensive language identification 7 out of top 10 submissions used BERT finet tuning. Zampieri et al. (2019) highlighted that 8% of 104 systems participated in the shared task used BERT based fine-tuning.
In this work, we analyze the effectiveness of different ways of learning representations with character-based models, subword or BPE based models and word features based models. We also combine these with well known pre-trained word embeddings and very large pretrained language models for fine-tuning and detecting abusive language in text. In Section 3 we describe the datasets that we study in this work for hatespeech and abusive detection.

Datasets
We experiment with three datasets: Twitter dataset (Waseem and Hovy, 2016), Personal Attack and Toxicity datasets from Wikipedia Talk dataset (Wulczyn et al., 2017) that covers sexism/racism, personal attack and toxicity aspects of abuse in user generated text online.

Twitter Dataset
We use the hatespeech Twitter dataset (Hatespeech) provided by Waseem and Hovy (2016). This dataset was created from a corpus of 136k tweets collected from Twitter by searching for commonly used racist and sexist slurs on various ethnic, gender and religious minorities over a two-month period. The original data had 16,907 tweets corresponding to sexist, racist and neither labels (3378, 1970 and 11559 respectively). However, we could only retrieve 11170 of the tweets (2914: sexism, 17: racism and 8239: neither) with python's Tweepy library. Similar issue of missing tweets has been reported by Mishra et al. (2018). However, the percent of tweets we lost are much higher than theirs and most of the tweets lost are for the racism label. We have lost majority of the tweets corresponding to sexism label. Since we lost large chunk of tweets we conduct our experiments on cross validation of 5 splits and report scores on all of the 5 splits.

Wikipedia talk page
We use the personal attacks (W-ATT) and toxicity (W-TOX) datasets that were randomly sampled from 63 Million talk page comments from the public dump of English Wikipedia by Wulczyn et al. (2017). Each comment in both the datasets were annotated by at least 10 workers and we use the majority label as its gold label. Overall, we have 115.8k comments in W-ATT dataset (69.5k, 23.1k and 23.1k in train, dev and test splits respectively) and 159.6k comments in W-TOX dataset (95.6k, 32.1k and 31.8k in train, dev and test splits). Similar to hatespeech dataset both the W-ATT and W-TOX datasets also have skewed distribution of labels having 13.5% and 15.3% of them labeled as abusive.

Methods
In this section, we present various word decomposition methods and modeling architectures we analysed for studying Twitter and Wiki Talk page W-ATT and W-TOX comment datasets.

Word-based Model
As a baseline, we adpot the fastText (Grave et al., 2017) classification algorithm. The fastText algorithm performs mean pooling on top of the word embeddings w emb i to obtain a document representation. This document representation is passed through a Softmax layer to obtain classification scores. The embeddings can either be learned or can be initialized with pre-trained embeddings and fine-tuned during training. We run multiple variants of fastText in our experiments.

Subword-based Model
Subwords are formed by concatenating all the characters of a particular length within a word boundary. Addition of subwords gives the model ability to learn words which are misspelled such as emnlp and emnnlp are similar. A pure word based model would consider emnnlp as out-of-vocabulary (OOV) word, if not seen in training set, but a subword model would decompose emnnlp into "emn" and "nlp", and train subword embeddings w emb sub for each of these subwords. We take subword variant of fastText model to incorporate subword context into the model. The algorithm considers all subwords of varying lengths within the boundary of a word.

Joint Word and Character Embedding Model
Our joint word and character embedding based model is adapted from Kim (2014) and Peters et al. (2018). We refer to Kim (2014) as TextCNN going forward. Let x i be the input word and c n 0 be its character representation, where n is the number of characters in the word. We transform c n 0 representation by passing through a character embedding layer, which is a n-gram Character-CNN similar to (Peters et al., 2019). The output of the n-gram Char-acterCNN is concatenated with the word's corresponding pretrained embedding to obtain w emb f ull as described in 1(a) Character-level features are concatenated with w emb i , the word embedding of word i, to form w emb f ull , the full set of word-level input  Figure 1: Architecture for model described in 4.3. In Figure 1(a), we present an example of for obtaining a word embedding by concatenating character embeddings with the embedding of the word itself. These final embeddings are then fed into the non-static variant of the Kim2014 (Kim, 2014) architecture (shown in Figure 1(b)). The layers of Kim2014 model alongwith the character CNN layer are updated during training. features: We randomly replace singleton words with special [UNK] (unknown) tokens for obtaining its w emb i , and also apply dropout (Srivastava et al., 2014) on w emb f ull . The input word embeddings w emb f ull , in a sentence with l tokens and convolutional window size h, w emb i:i+h is transformed through a convolution filter w c : where b c is a bias term and f is a non-linear function (ReLU). This produces a feature map c, on-top of which we apply a global max-over time pooling.
This process for one feature is repeated to obtain m filters with different window sizes h. The resulting filters are concatenated to form TextCNN document representation. The document representation is passed through Softmax layer to obtain classification predictions. We also experiment with original version of TextCNN, which is a pure word based model, without the character embedding variant.

End-to-end Character Embedding Model
To understand the potential of end to end character based models in dealing noisy text, we use Very Deep Convolutional Neural Network (VDCNN) architecture proposed by Conneau et al. (2017) that operates at character level by stacking multiple convolutional and pooling layers that sequentially extract a hierarchical representation of the text. This representation is fed into a fully connceted layer which is trained to maximize the classification accuracy on training data.

Pretrained Language Models
Recent liteature have shown that transferring knowledge from large pre-trained language models could benefit various NLP tasks either by adding a task specific architecture or by fine-tuning the language model for the end task (Peters et al., 2018;Devlin et al., 2018;Peters et al., 2019). In this work, we use BERT model and we fine-tune the model for each of our train datasets.

Experiments
In this section, we present different variants of the models described in Section 4 presented in Table  1.
fastText: We use multiple variants of fastText model. Our fastText ngrams=1 uses embeddings learned for each unigram. We treat this as our baseline model without any preprocessing of the text. Our fastText ngrams=2 model also uses bigrams along with unigrams as independent tokens to learn embeddings. All pairs of bigrams are chosen wtihout ant frequency threshold. Our fastText ngrams=2 + subword (2 − 6) also uses all subwords within a word boundary within the range of 2 − 6. All our models are trained with learning rate of 0.5 and for 5 epochs. TextCNN (Kim, 2014): We run the TextCNN for classification in non-static mode, with learning rate of 0.0001, dropout of 0.5 for 50 epochs. We have used default kernel window sizes N f = (3, 4, 5) with m = 100 filters. We initialize the embeddings layer with word2vec pretrained embeddings 1 publicly available from google. We used the non-static variant of TextCNN, with pretrained embedding initialization for word embedding layer.
TextCNN + char n-grams: The word embedding layer is constructed for this approach as mentioned in 1(b). The kernel window sizes h for character tokens are N f = (1, 2, 3, 4, 5, 6) with m = (32, 32, 32, 64, 64, 64) filters respectively. Increasing the number of filters further to match those of parameters in Peters et al. (2018) for character tokens led to overfitting on our datasets, and hence we reduced the parameters. All the layers are allowed to be tuned while training. The character embeddings CNN layer is initialized randomly with Xavier initialization (Glorot and Bengio, 2010 we report in our experiments.

BERT Wordpiece Tokenizer Model with Word models
We use Wordpiece (BPE) model of BERT (Devlin et al., 2018) pretrained on BooksCorpus and English Wikipedia, produced using 30000 merge operations. BERT uses this model as precursor before encoding the text through transformer. We try to examine the benefit of the wordpiece text encoding vs the benefit we obtain from fine-tuning the pretrained LM. We hypothesize that pretrained BPE model splits a word into most frequent subwords found in the wikipedia corpus, which can help in mining the informative subwords. The informative subwords might prove very beneficial in noisy settings where we observe missing spaces and typos. In order to achieve this, we use this pretrained BPE model for encoding the document text before inputing to our word based models, TextCNN and fastText word variant. This is demonstrated in Figure  6 Results and Analysis Table 1 presents the Weighted F1 score based on the support of each of the classes in the test set for our classification task. For a classification problem with N samples in the test set and C classes, Weighted F1 score 2 is defined as 2 we use sklearn library for computing macro and weighted f1 scores in the paper https://scikit-learn.org/ stable/modules/generated/sklearn.metrics. f1_score.html

PROCESSED DOC
Original a complaint about your disruptive behavior here : https : / / en . wikipedia . org / wiki / wikipedia : administrators % 27 noticeboard / incidents # disruptive users vandalizing article about spiro koleka Custom BPE complain about your disruptive behavior here : https : / / en . wikipedia . org / wi@@ ki / wikipedia : administrators % 27 noticeboard / incidents # disrup@@ ti@@ ve @@ user@@ s @@ vandali@@ z@@ ing @@ article @@ about @@ spi@@ ro@@ @@ ko@@ le@@ ka BERT tokentization complain about your disrupt @@ive behavior here : https : / / en . wikipedia . org / wiki / wikipedia : administrators % 27 notice @@board / incidents # disrupt @@ive users van @@dal @@izing @@ article @@ about @@ sp @@iro @@ ko @@le @@ka  where n i denotes the number of samples in class i. We have reported weighted F1 as the twitter data we obtained had only 17 samples for racism, with stratified CV split having only 4 samples on average. As the results on this label could be very random and prone to lot of variance due to very little number of samples in the train and test set, we choose to use weighted F 1 over macro F 1.
We also have observed very high variance among performance in different CV splits, hence report the numbers separately on each of them. Table 1 also mentions if each of the experiment involves using word splitting via BPE, either by pretrained BERT Wordpiece tokenization model, or by training a custom BPE model on our given dataset. We have also highlighted the individual best performance from a modeling architecture with a * . Table 2 presents the Macro F 1 score on W-ATT and W-TOX datasets. Macro F 1 score is defined as : We have picked the best performing models from 1 for macro F 1 comparison. We have also compared to previous approaches that have achieved best performance on these datasets. Mishra et al. (2018) reported Macro F1 on both validation and test data together. From their work it is unclear if the model is tuned on validation, and same data was used along with test to report numbers. Hence, we only use their number as reference. The main conclusions of these experiments are fourfold: 1. Pretrained BPE models transfer well: Pretraining a Wordpiece model on a large general corpus like wikipedia, and using this for encoding input text by splitting words has shown significant improvements for all the word based models. The fastText word model with bigrams (row 3 in table 1) trained with BERT tokenization achieves the best performance on 1st split of the hatespeech data, and also shows improvement over the native fastText bigrams model on Wiki-ATT dataset. The same observation can be made with TextCNN word model with preprocessing by pretrained BERT Wordpiece tokenization model(row 11 in Table 1). However, we have either noticed a slight degradation or an insignificant improvement by applying BPE encoding with fastText subword based model. This is expected as breaking the informative subwords from BERT into much smaller units might result in lot of noisy updates.

Predicted Label
Technique Text not attack Original believe that he was the greatest mother-fucker in the world attack * BPE believe that he was the greatest mother## -## fuck## er in the world not attack Original many thanks for your leaving all edits alone in future with such idiotic diatribes attack * BPE many thanks for your leaving all edit## s alone in future with such idiot## ic## dia## tri## bes 3. End to End Char models arent as effective as subword or word + char models: Adding character based embedding to aid word embedding based models, and subword models enhance the performance over their pure word based modeling baselines. This proves the hyptohesis of modeling at subword level definitely is beneficial for detecting abusive language. Interestingly, end to end character models arent as effective, which demonstrates the basic fact knowledge of word leads to a powerful representation, and word boundary information is still informative in noisy settings.
4. State-of-the-art performance on W-TOX and W-ATT with BERT finetuning: Table 3 shows the results for Macro F 1 score of our models in comparison to previous approaches that have achieved best performance on these datasets. Mishra et al. (2018) reported Macro F1 on both validation and test data together. From their work it is unclear if the model is tuned on validation, and same data was used along with test to report numbers. Hence, we only use their number as reference.
We have also observed better numbers with their approach. We have achieved state of the art macro F 1 score on W-ATT and W-TOX datasets with BERT finetuning. We have also added performance of BERT Wordpiece tokenized text with word based models for comparison, with their numbers running really close to those of BERT.

Conclusion and Future Work
Existing literature has shown the importance of using finer units such as character or subword units to learn better models and robust representations for identifying abusive language in social media.
In this work, we explore various combinations of such word decomposition techniques and present experiments that bring new insights and/or confirm previous findings. Additionally, we study the effectiveness of large pretrained language models trained on standard text in understanding noisy user generated text. We further investigate the effectiveness of subword units ("wordpieces") learned for unsupervised language modeling can improve the performance of bag-of-words based text classification models such as fastText. We evaluate our models on Twitter hatespeeech, Wikipedia toxicity and attack datasets.
Our experiments demonstrate that encoding noisy text via BERT wordpiece tokenization model before passing it through word-based models (fast-Text and TextCNN) can boost the performance of word-based models and achieve state-of-the-art performance. Based on our experiments, we conclude that subword models perform competitively with character-based models and occasionally outperform them. We observe that adding character embeddings to TextCNN model can slightly boost the performance compared to word-CNN models.
Our experiments on fine-tuning BERT show improvements on both Wikipedia toxicity and attack datasets. We observe that BERT can effectively transfer pretrained information to classifying tweets and user comments despite the domain shift of pre-training on BookCorpus, Wikipedia Text . Future work in this direction could include pretraining BERT on huge collection of social media text, which might further enhance the performance of identifying abusive language on social media text. Recent work by Wiegand et al. (2019) highlights that most of the datasets that study abusive language are prone to data sampling bias and abusive language identification on realistic scenario is much harder with higher percentage of implicit content. A potential future direction would be to explore how pretrained models on generic text could incorporate or handle implicit abuse.