Using Transfer-based Language Models to Detect Hateful and Offensive Language Online

Distinguishing hate speech from non-hate offensive language is challenging, as hate speech not always includes offensive slurs and offensive language not always express hate. Here, four deep learners based on the Bidirectional Encoder Representations from Transformers (BERT), with either general or domain-specific language models, were tested against two datasets containing tweets labelled as either ‘Hateful’, ‘Normal’ or ‘Offensive’. The results indicate that the attention-based models profoundly confuse hate speech with offensive and normal language. However, the pre-trained models outperform state-of-the-art results in terms of accurately predicting the hateful instances.


Introduction
The majority of the tweets on Twitter or posts on Facebook are harmless and often posted purposefully, but some express hatred towards a targeted individual or minority group and members. These posts are intended to be derogatory, humiliating or insulting and are defined as hate speech by Davidson et al. (2017). Different from offensive language, hate speech is usually expressed towards group attributes such as religion, ethnic origin, sexual orientation, disability or gender (Founta et al., 2018b). Some of the biggest firms invest heavily in tracking abusive language, e.g., automatic detection of offensive language in comments (Systrom, 2017(Systrom, , 2018 or giving a percentage of how likely a text is to be perceived as toxic. 1 However, these and other existing tools share a common flaw of not distinguishing between offensive and hateful language. One important reason to keep these two separate is that hate speech is considered a felony in many countries. The task of separating offensive and hateful language has shown to be demanding; however, with the recent scientific breakthroughs and the concept of transfer learning, we can take huge steps in the right direction. The paper investigates the effects of transferring knowledge from the Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2019) language model on distinguishing hateful, offensive and normal language, by fine-tuning the pre-trained BERT language model with data containing hateful and offensive language, and comparing its performance to the state-of-the-art on two widely used hate speech detection datasets. Those datasets are presented Section 2. Section 3 then gives an overview of related work in the field of hate speech detection. Section 4 describes the implemented system architecture. Section 5 presents the experiments, including setup and results, while Section 6 evaluates and discusses those results. Section 7 concludes and suggests future work.

Data
Many existing datasets containing hate speech are publicly available for use and consist of data from several sources online, mainly Twitter (Waseem and Hovy, 2016;Waseem, 2016;Chatzakou et al., 2017;Golbeck et al., 2017;Davidson et al., 2017;Ross et al., 2016;ElSherief et al., 2018;Founta et al., 2018b), while some cover other sources such as Fox News comments (Gao and Huang, 2017) and sentences from posts on the white supremacist online forum Stormfront (de Gibert et al., 2018). Almost all available datasets are labelled by humans, 2 which results in different approaches taken when creating and annotating the datasets. Some researchers use expert annotators (Waseem and Hovy, 2016), others use majority voting among several   amateur annotators on platforms such as Crowd-Flower (Davidson et al., 2017). However, the task of hate speech detection lacks a shared benchmark dataset (Schmidt and Wiegand, 2017) that can be used to measure the performance of different machine learning models. Further, most annotation schemata follow Waseem and Hovy (2016) by splitting the data into only two basic classes, either hate and none hate or offensive and non-offensive (classes that then also often are split, e.g., labelling hateful tweets as either sexist or racist). However, it is debatable whether those labels are sufficient to represent hateful and abusive language. In contrast, a few datasets make the distinction between hateful and offensive language, e.g., Davidson et al. (2017) and Founta et al. (2018b), which will be used here and abbreviated D and F, respectively. The dataset by Davidson et al. (2017) consists of 24,783 English tweets and their labels along with some information including the number of annotators. The number of CrowdFlower annotators range from three to nine, and majority voting was used when deciding the final class for a tweet: "Hate Speech", "Offensive Language" or "Neither". The label distribution can be seen in Table 1. The dataset created by Founta et al. (2018b) contains almost 100k annotated tweets with four labels, "Normal", "Spam", "Hateful" and "Abusive". As the authors only provide tweet IDs for researches to retrieve tweets through the Twitter Application Programming Interface (API), some tweets may for several reasons not be retrievable, e.g., a tweet or the user account behind a tweet may have been deleted; thus, of the 99,799 provided tweet IDs, only 68,299 tweets were retrieved. The label distribution of those compared to the original label distribution for dataset F is shown in Table 2. 3 Related Work Nobata et al. (2016) mention some challenges within hate speech, e.g., that the abusive language with time evolves new slurs and clever ways to avoid being detected. Hence they performed a longitudinal study over one year to see how trained models react over time, employing n-grams, word embeddings, and other linguistic and syntactic features. All features combined yielded the best performing model; however, looking at individual features, character n-grams performed best, a result that also Waseem and Hovy (2016) reported.
Transferring knowledge from word embeddings to be used as input to neural networks has been a common technique. Gambäck and Sikdar (2017) experimented with character n-grams in combination with word embeddings from word2vec Mikolov et al. (2013) in various Convolutional Neural Network (CNN) setups, with the best performing model using transferred knowledge from word2vec. Adding character n-grams boosted precision, but lowered recall. Badjatiya et al. (2017) experimented with several machine learners and neural networks, with the best performer being an Long Short-Term Memory (LSTM) with random word vectors where the network's output was used as input to a Gradient Boosted Decision Tree. However, their results have shown questionable and difficult to reproduce (Mishra et al., 2018;Fortuna et al., 2019). Pavlopoulos et al. (2017a,b) tested word embeddings from both GloVe and word2vec in an Recurrent Neural Network (RNN), while Pitsilis et al. (2018) utilised an RNN ensemble, although without use of word embeddings, but feeding standard vectorized word uni-grams to multiple LSTM networks, aggregating the classifications, to outperform the previous state-of-the-art.
Park and Fung (2017) created a hybrid system that tried to capture features from two input levels, using two CNNs, one character-based and one word-based. Meyer and Gambäck (2019) proposed an optimised architecture combining components with CNNs and LSTMs into one system. One part of the system used character n-grams as input while the other part used word embeddings. They used the dataset from Waseem and Hovy (2016), obtaining better results than previous solutions. Most of the research discussed above used that dataset (with labels 'Sexist', 'Racist' or 'Neither') or a slightly modified version (Waseem, 2016).
The dataset by Davidson et al. (2017) in contrast separates hateful language from offensive and normal language, making the task harder. Zhang et al. (2018) used this dataset and six other, but merged the offensive class with the normal class. On the 2-class hate vs normal language task, they outperformed the state-of-the-art on 6 out of 7 datasets with a system feeding word embeddings from word2vec into a CNN to produce input vectors for an LSTM network with GRU cells performing the final classification. Founta et al. (2018a) used the same dataset, but kept the offensive samples separate from the normal ones, thus taking on the challenge of separating hateful and offensive language. They ran two networks in parallel, one RNN with text input and one feed-forward network with metadata input, followed by a concatenation layer and a classification layer, performing slightly below the F 1 -score 0.900 Davidson et al. (2017) achieved with a baseline LR model. However, Kshirsagar et al. (2018) surpassed the baseline using pre-trained word embeddings as input to multiple Multilayer Perceptron (MLP) layers, achieving a total F 1 -score of 0.924. Still, the Fscore increase is due to better performance on the 'Normal' and 'Offensive' classes, with the model actually performing worse on the 'Hate' class.
This agrees with Malmasi and Zampieri (2018) who tested several supervised learners and ensemble classifiers on the dataset, reporting a noticeable difficulty of distinguishing hateful language from profanity. Their extensive results analysis showed that tweets with the highest probability of being tagged as hate usually are targeted at a specific social group, so that contextual and semantic document features may be required to improve performance. Gaydhani et al. (2018) in contrast claimed near-perfect performance, misclassifying only 0.035% of true hate speech samples on a combination of datasets from Davidson et al. and Waseem (2016) using n-grams as features and feeding the TF-IDF values of these into classifiers such as Support Vector Machine, Naïve Bayes and Logistic Regression. However, analysing their training and test data 3 shows that 74% of the test data is either duplicate or in the training data, giving a highly biased test set and questionable results.

Architecture
Word embedding techniques based on bag-ofwords contexts, such as word2vec (Mikolov et al., 2013), only capture the semantic relations among words (Vashishth et al., 2019), whereas language models are more complex and can capture the meaning of a word in a sentence, i.e., its context. This work focuses on such language models and explores the effect of transferring knowledge from a substantial pre-trained language model to a classifier predicting hateful and offensive expressions.

Preprocessing
Twitter authors often make use of abbreviations and internet slang. Many tweets in addition contain retweeted content, mentions of other users, URLs, hashtags, emojis, etc. As language models can capture context between words and prefer complete sentences, only simple preprocessing was used to clean the data. NLTK's (Bird et al., 2009) TweetTokenizer was used to remove URLs, numbers, mentions and 'RT' retweet marks. Stop words were not removed to keep as much context as possible for the language model to capture.
HuggingFace's BertTokenizer was used for text normalisation and punctuation splitting as well as WordPiece subword-level tokenisation. Words that do not occur in the vocabulary are segmented into subword units, so there are no out-ofvocabulary words.

BERT Model Architecture
BERT's language models can be pre-trained from scratch using only a plain text corpus or finetuned with a domain-specific corpus. Although pre-training is a one-time procedure, it is relatively expensive requiring a large amount of crawled text and computational power. However, Devlin et al. (2019) released several pre-trained models, two of which were used in the experiments: BERT Base, Uncased (12 encoder layers with 768 hidden units and 12 attention heads; 110M parameters) and BERT Large, Uncased (24-layer, 1024-hidden, 16heads; 340M parameters), that were trained on the English Wikipedia and BookCorpus (Zhu et al., 2015) for 1M update steps. Both models are lowercased and have pre-trained checkpoints that can either be trained with more data or fine-tuned with task-specific data. Both of these approaches were implemented and tested in the experiments. The models are trained with word sequence length up to 512, but this can be shorted when fine-tuning, to save substantial memory. Each encoder in the stack applies self-attention and then passes the results through a simple feed-forward network, before handing the output over to the next encoder.
Most language models pass each input token through a token embedding layer to achieve a numerical representation. BERT solves this by passing each token through three different embedding layers (token, segment and positional). Each of these three layers converts an input sequence of tokens to a vector representation of size (n , 768), where n is the number of tokens in the input sequence. These three vector representations are summed element-wise to construct a single vector used as input for BERT's encoder stack.
The model output is where BERT separates itself from a traditional transformer: Each token position in the input sequence outputs a length 768 hidden vector for BERT Base and 1024 for BERT Large. Each encoder outputs hidden vectors that can be used as contextualised word embeddings that can be fed into an existing model. For the fine-tuning approach, only the hidden vectors from the final encoder in the stack are relevant and only the hidden vector in the first position is used for sentence clas- sification. This vector can be used as input to any classifier. Devlin et al. achieved great results using only a single-layer network, but the final systems used here are slightly modified with an additional linear layer of size 2048 added to increase the complexity of the model. (An RNN model was also tested, but omitted as learning did not improve.) For fine-tuning, only the number of labels needs to be added as a new parameter, 3 and 4 for the systems used here. All BERT parameters and the final classifier network parameters are fine-tuned jointly to maximise the systems' predictive capabilities. The logits from the last linear layer are passed through a softmax layer to calculate the final label probabilities. Between BERT's pooled output and the first linear layer, and between the first and second linear layers, dropout is utilised to regulate the systems to reduce the risk of overfitting. In addition, cross entropy is used to calculate the classification error of each sample. To update the whole network's weights iteratively based on the training data, HuggingFace's version of the Adam optimiser (Kingma and Ba, 2017) is used with weight decay fix, warmup, and linear learning rate decay. Figure 1 gives an overview of the system architecture implemented for the experiments.

Further Language Model Training
Starting from BERT's Wikipedia and BookCorpus checkpoint, it is possible to further train the language model with domain-specific corpora. This technique of using unlabelled data from the same domain as the target task to train the language model further using the original pre-training objective(s) was first seen in ULMFiT (Howard and Ruder, 2018). Since the approach taken here only uses two datasets, there are still a lot of datasets from the target domain available. Remember, the pre-training only requires the raw text, and so the labels are irrelevant. All available datasets mentioned at the beginning of Section 2, except the two used for the target task, were collected and used to further train BERT on domain data. Furthermore, BERT's English vocabulary consists of 30,522 segmented subword units learned beforehand. Some vocabulary entries are placeholders that can be replaced with new words. ElSherief et al. (2018) created a list of keywords commonly used as hate speech, and most of those were placed in the unused placeholders when further training BERT from its checkpoints.
One of BERT's pre-training objectives is next sentence prediction in which the model predicts whether one sentence follows another sentence or not. As a result, the input format for further training BERT is a single file with untokenised text and one sentence per line. Natural Language Toolkit (NLTK)'s sent_tokenizer was used to split documents into sentences of at least one word. Since tweets rarely consist of multiple complete sentences due to Twitter's 280 character limit, some tweets were split in the middle to construct two sentences instead of discarding them.
Other datasets were formatted more easily, e.g, the Stormfront forum data from de Gibert et al.
(2018) contained a large folder where each text file was a sentence. All text data from the datasets were merged into one file yielding one large text file with nearly 170,000 lines. This file was then used to further train two language models from BERT Base and Large checkpoints on the two original pre-training objectives, masked LM and next sentence prediction. The output of this process, two language models, trained on Wikipedia, BookCorpus, and domain data was used in the experiments to investigate the effect of further training the language model with domain-specific data.

Experiments and Results
The two original pre-trained language models BERT Base and BERT Large from Devlin et al. (2019) were tested together with the two language models (BERT Base* and BERT Large*) further trained with domain-specific data. Each system's performance was tested with the two datasets D (Davidson et al., 2017) andF (Founta et al., 2018b). Dataset F annotates tweets as 'Hateful', 'Offensive' 'Spam' or 'Normal'. When identifying hateful and offensive language, the 'Spam' class is redundant and was omitted. However, to compare to previous research, experiments with the original 4-class dataset F were also carried out.
All text data in the experiments were lowercased. Both datasets were split into a training set containing 80% of the total samples and a held-out test set containing the remaining 20%, with Scikitlearn's stratified splitting function used to ensure equal class balance between the sets. The order of the training samples was shuffled before each run. Cross-validation with multiple folds was not implemented due to framework limitations.
All experiments were run on devices with at least 64GB RAM, the amount recommended by the creators of BERT. The two original language models were pre-trained with a sequence length of 512 and batch size 256. The fine-tuned models had a sequence length of 128 and batch size 32. All four language models were trained with the Adam optimiser, with the optimal learning rates found to be 3e-5 for the fine-tuning process and 2e-5 for the classification process after an exhaustive search with parameters suggested by Devlin et al. (2019). Other parameters shared by the four systems are a dropout probability of 10% on all layers, the number of training epochs which was 3, and an evaluation batch size of 8. The fine-tuning of the language models took around 3 hours on two Nvidia V100 GPUs with 32GB RAM each, while classification with BERT Base and Large took on average around 1 and 2 hours, respectively. System performance will be measured by micro averaged Precision, Recall, and F 1 -score, as this is more suitable for unbalanced datasets and gives detailed insights into how the models classify each sample. The macro averaged total for each metric will also be presented for comparison reasons.

Dataset from Davidson et al. (2017)
Dataset D is quite unbalanced with 77% of the tweets being annotated as 'Offensive' and only 6% being labelled 'Hateful'. As seen in Table 3, all four models perform more or less equally in almost all metrics, and are able to correctly classify tweets as 'Normal' and 'Offensive' fairly well. BERT  Out of the four models, BERT Large also obtains the best scores for the 'Hateful' class, with precision, recall and F 1 -score of 0.520, 0.364 and 0.428, respectively. 52% of the examples the model predicted as hateful were correctly classified. Only 36% of the total true hateful tweets were classified correctly, yielding low recall. The two models with general language understanding, BERT Base and Large, outperform the two models with domainspecific language understanding on the 'Hateful' class. On this class, BERT Large* obtains a F 1score of 0.331 compared to BERT Large's F 1 -score of 0.428. This gap in F 1 -scores is unexpected as the intention of further training the language models with domain-specific data was to increase the hateful language understanding.

Dataset from Founta et al. (2018b)
Dataset F is nearly three times the size of D. The label distribution is also more balanced with roughly half of the samples labelled 'Normal' and the rest distributed between the other three classes. Although only 6% of the tweets are annotated 'Hateful', this is a fair representation of the real world where only a small portion of the online content is hate speech. The best scores for each metric were then spread across the four models and there was no clear difference between the models: all obtained an F 1 -score of 0.67. As with dataset D, the models were able to correctly classify tweets as 'Normal' and 'Offensive' quite well while misclassifying most of the true 'Hateful' and 'Spam'  tweets. The best F 1 -scores for the 'Normal' and 'Offensive' classes were 0.869 and 0.884, respectively, obtained by BERT Base*, but the other models were right behind. The only telling difference between the models was the scores on the 'Hateful' class, with BERT Base the clear winner.
Removing the 'Spam' class from the original dataset, we immediately see an increase in the models' scores for all three classes as shown in Table 4. As expected, the increase is most noticeable for the 'Normal' class which previously was highly confused with the 'Spam' class. The increase is less notable for the 'Hateful' class although BERT Base outperforms the other models by a margin. BERT Base is surprisingly the model that performs best overall, beating the other three models on nearly every metric. Remarkably 97% of the tweets labelled as 'Normal' are correctly classified by the model, but only 30% of true hateful samples. Again, the models seem to recognise true hate speech as less hateful than the annotators. The two models trained with domain-specific data, BERT Base* and BERT Large*, perform worse on the 'Hateful' class than the other two models. This is an interesting observation as more training with domain-specific data has shown to increase the performance of models in previous solutions.

Evaluation and Discussion
The main difference between the two datasets used in the experiments is the size and label distribution. The size of dataset F allows for more training samples than dataset D although systems transferring knowledge from pre-trained language models have shown that even small datasets can achieve similar performance (Howard and Ruder, 2018). The four models' overall performance on datasets D and F are the same despite the fact that the latter dataset allows for more language model fine-tuning. The label distribution in dataset F is more realistic than dataset D, where a large portion of the samples is labelled as 'Offensive'. However, this unbalance of dataset D does not seem to affect the models' performance noticeably. The reason is probably that dataset D contains a sufficient amount of class samples for the models to learn the other two classes. This ability to learn with a few training examples is one of the main advantages of using language models instead of traditional word embeddings.

Language Model Selection
Although datasets without the distinction between offensive and hateful language were irrelevant for testing the models in the experiments, they were used as unlabelled data to further pre-train two BERT language models. This additional training is intended to give the language models domainspecific language understanding and has shown to increase the overall performance in other tasks (Devlin et al., 2019). However, the results obtained from the experiments show that the two models with domain-specific language understanding performed worse or equal to the language models with general language understanding. As we can see in Table 3, the worst performance of the two extended BERT models was on dataset D. BERT Base* and Large* obtained macro-averaged F 1scores of 0.725 and 0.729, respectively, while the original BERT models obtained F 1 -scores of 0.751 for BERT Base and 0.759 for BERT Large. The difference between these scores is a result of the models' performance on the 'Hateful' class as the performance on the 'Normal' and 'Offensive' classes are near identical for all four models. BERT Large outperforms the other three models on the 'Hateful' class with a F 1 -score of 0.428. This is in line with Devlin et al. (2019) who found that BERT Large outperformed BERT Base on several other tasks.
However, this is not the case for the results obtained by BERT Large on dataset F. Looking at Table 4, we observe that the smaller model BERT Base outperforms BERT Large on nearly every metric. The most compelling difference can again be seen in the 'Hateful' row, where BERT Base achieved an F 1 -score of 0.393 compared to BERT Large's F 1 -score of 0.362, mainly as a result of better recall obtained by BERT Base.
Surprisingly, there is no telling difference when comparing the two models with general language understanding to the two models with domainspecific language understanding. Further training with large domain-specific corpora is expected to be beneficial and increase the performance on downstream tasks like hate speech detection. However, the results from the experiments do not reflect this assumption, and it seems like all four models are able to capture similar features, thus performing equally well. Next sentence prediction is one of BERT's two pre-training objectives. So in order to further pre-train the language model, it is necessary to obtain documents containing at least two sentences. This became a limitation, as the domainspecific data used in the experiments mostly consist of tweets, that often contain only a single sentence and omitting every single-sentence tweet would lead to a much smaller training corpus. In order to include single-sentence tweets in the training corpus, they were split at the middle. This is not optimal and may be one of the reasons why BERT Base* and Large* did not perform as expected.

Error Analysis
Generally, the results from each dataset indicate that it is hard to separate hateful language from offensive and normal language. This was also the key finding stated by Malmasi and Zampieri (2018) and Davidson et al. (2017) when testing their models' performance on dataset D. For dataset D, most of the annotated hateful samples are confused with the 'Offensive' class, and this may be due to the skewed dataset where the 'Offensive' samples dominate. With dataset F, there is roughly an equal distribution of misclassifications between the 'Offensive' and 'Normal' class. This indicates that neither of the tested models using features from the pre-trained language model is capable of distinguishing hateful language from offensive and neutral language with acceptable accuracy.
To investigate BERT Base's predictions on dataset F deeper, some correctly and incorrectly classified instances were sampled and analysed. The model tends to predict instances containing clear racist or homophobic slurs as hate speech, while obvious hate speech appears more straightforward for the model to understand and accurately predict. Several instances annotated as 'Hateful', but predicted as 'Normal' or Offensive' by the model do not appear to be clear hate speech and are perhaps mislabelled by the human coders and  correctly predicted by the model. The text "ISIS message calls Trump 'foolish idiot"' was found four times in the original dataset with different authors, being annotated twice as 'Hateful' and twice as 'Offensive', with the model predicting the human-chosen label on only one of the instances. As stated by Chatzakou et al. (2017), annotation is even hard for humans and this is an example of the gold standard not being perfect even though the Founta et al. dataset was thoroughly constructed. Table 5 shows the results obtained on dataset D by BERT Large compared to previous results. Although the dataset is widely used, some researchers (e.g., Zhang et al., 2018) chose to merge the 'Offensive' and 'Normal' classes into one non-hate class; making them not comparable to the results carried out in the experiments. All four systems in Table 5 perform equally well with F 1 -scores around 0.90.   tested several machine learning algorithms on dataset F intending to create a baseline for this dataset. Table 6 shows that the two BERT models and Lee et al.'s word-based RNN-LTC model perform similarly on this dataset. However, BERT Base* achieves an F 1 -score of 0.361 on the 'Hateful' class, compared to the RNN-LTC model's F 1 -score of 0.302. This indicates that BERT Base* is better at separating hateful language from the other types of language. RNN-LTC outperformed BERT Base* on the 'Spam' class resulting in the similar total average scores.

Comparison to State-of-the-Art
The experimental results on dataset F without the 'Spam' class were compared to three baseline systems, since no comparable research was found. The macro-averaged scores are shown in Table 7. Out of the four tested models, BERT Base was the best performing with an F 1 -score of 0.76. Again, BERT Base's performance on the "Hateful" class   is compellingly better than the best performing Logistic Regression model. BERT Base obtain an F 1 -score of 0.393 while the LR model achieves an F 1 -score of 0.310. The improved performance on the "Hateful" class on both version of dataset F implies that models transferring knowledge from pre-trained language models are able to distinguish the nuances of abusive language more accurately. Model selection is important when creating a hate speech predictor; however, Gröndahl et al. (2018) argue that model architecture is less important than the type of data and labelling criteria. They found that the tested models, which ranged from simple Logistic Regression to more complex LSTM, performed equally well when recreating several state-of-the-art solutions. Gröndahl et al.'s results are consistent with the investigations conducted during the experiments, where changes in the final classifier's complexity did not reflect any changes in the results.

Conclusion and Future Work
To explore the effects of applying language models to the downstream task of hate speech detection, four systems based on the BERT language models were implemented and tested on two datasets annotated both for hateful and offensive language. Two of the systems were further pre-trained with unlabelled domain-specific data. However, the results did not reflect any notable improvement with the extended language models.
All four models achieved F 1 -scores close to or above state-of-the-art solutions on both datasets, and their ability to correctly distinguish hate speech from offensive and ordinary language was considerably better than the compared solutions, but the scores on the 'Hateful' class are not sufficient enough to bring the systems into practical use, as hateful expressions would pass through the system or more benign cases would be incorrectly censored. Still, language models bring a considerable potential to understanding all the nuances of hateful utterances, and further exploration of how to most effectively train and transfer knowledge from them is necessary.
The models used in the experiments were all pre-trained on the English Wikipedia and Book-Corpus to obtain general language understanding. Typically, the language that appears in Wikipedia articles and books are somewhat domain neutral and formal. This language may be too different from the hate speech domain in terms of words and sentences. Therefore, it may be beneficial to collect documents from hate speech datasets and create one large corpus, which can be used as input data to pre-train BERT's encoders from scratch.
A problem with BERT is the vast number of parameters that need to be set, leading to memory problems and long training times. However, the usage of transformers for language processing is a fast-moving field, so several ideas and strategies have lately been introduced to improve on the original BERT setup. One of those -such as AL-BERT, 'A lite BERT' (Lan et al., 2020); GPT-3, 'Generative Pre-trained Transformer' (Brown et al., 2020); continuous pre-training ('ERNIE 2.0'; Sun et al., 2020); transformers for longer sequences ('BigBird'; Zaheer et al., 2020); or layerwise adaptive large batch optimisation ('LAMB'; You et al., 2020) -could be tested on the task. Lan et al. (2020)'s ALBERT can drastically reduce the number of parameters and help solve memory problems and reduce training times. Zaheer et al. (2020)'s 'BigBird', with its sparse attention mechanism, allows for longer input sequences than BERT and is suitable for tasks where the datasets include longer documents. You et al. (2020) utilised large batch stochastic optimisation methods to reduce the training time of BERT remarkably.
As describe in Section 4.3, each tweet in the training set was split into two for the next sentence prediction task BERT is performing during pre-training. This was done because tweets rarely contain two full sentences. However, this strategy can lead to some loss of linguistic information and it may be better to just skip next sentence prediction during training and only perform the masked language model task.