Combining BERT with Static Word Embeddings for Categorizing Social Media

Pre-trained neural language models (LMs) have achieved impressive results in various natural language processing tasks, across different languages. Surprisingly, this extends to the social media genre, despite the fact that social media often has very different characteristics from the language that LMs have seen during training. A particularly striking example is the performance of AraBERT, an LM for the Arabic language, which is successful in categorizing social media posts in Arabic dialects, despite only having been trained on Modern Standard Arabic. Our hypothesis in this paper is that the performance of LMs for social media can nonetheless be improved by incorporating static word vectors that have been specifically trained on social media. We show that a simple method for incorporating such word vectors is indeed successful in several Arabic and English benchmarks. Curiously, however, we also find that similar improvements are possible with word vectors that have been trained on traditional text sources (e.g. Wikipedia).


Introduction
Social media has become an important source of information across numerous disciplines (Jaffali et al., 2020). For instance, it allows extracting and analyzing people's opinions, emotions and attitudes towards particular subjects, in a way which is difficult to achieve using other information sources. However, social media posts tend to be short and often contain abbreviations, slang words, misspellings, emoticons and dialect (Baly et al., 2017). For language models (LMs) such as BERT (Devlin et al., 2019), which have been primarily trained on Wikipedia, this poses a number of clear challenges. In the case of Arabic, the challenge is even greater, since social media posts are mostly written in regional dialects, which can be different from the language that is found in resources such as Wikipedia (Alali et al., 2019). In particular, the Arabic language can be divided into Classical Arabic, Modern Standard Arabic, and Dialectal Arabic (Alotaibi et al., 2019). The latter differs between Arabic countries, and sometimes among regions and cities. Social Media acts as the primary source where Arabic dialects appear as written text, due to the informality of these platforms.
Similar as for English, the best results in many Arabic NLP tasks are currently obtained with LMs. In particular, the AraBERT model (Antoun et al., 2020) has achieved state-of-the-art results in sentiment analysis, named entity recognition and question answering, among others. However, AraBERT was trained on Wikipedia and news stories. It has thus not seen the Arabic dialects in which most social media posts are written. Surprisingly, however, Antoun et al. (2020) found that AraBERT is nonetheless able to outperform other methods on social media tasks. This includes methods that use the AraVec (Soliman et al., 2017) embeddings, which are word2vec vectors trained on Twitter, and have a wide coverage of dialect words.
Our hypothesis is that AraBERT and AraVec have complementary strengths, and that better results can thus be obtained by combining these two resources. Similarly, for English tasks, we would expect that the performance of BERT on social media can be improved by incorporating word embeddings that have been trained on social media. However, for English we would expect to see a smaller effect, since compared to Arabic, the vocabulary of English social media is more similar to the vocabulary in traditional sources. To test these hypotheses, we propose and evaluate a simple classifier which combines language models with static word embeddings. Our main findings are that incorporating word vectors can indeed boost performance. Surprisingly, this even holds for word embeddings that have been trained on standard sources.

Related Work
While there is a large literature on NLP for social media, more efforts that focus on the Arabic language are needed. A notable work is Heikal et al. (2018), which developed a CNN and LSTM ensemble model for Arabic sentiment analysis. They used AraVec pre-trained word embeddings for the word embedding representation. Recently, Kaibi et al. (2020) proposed an approach that relies on the concatenation of pre-trained AraVec and fastText vectors. However, the best results on most datasets are currently achieved by fine-tuning AraBERT (Antoun et al., 2020), as already mentioned in the introduction. For the English language, Nguyen et al.
(2020) recently introduced BERTweet, a BERTbased language model that was trained on a large corpus of English tweets. Their experiments show that utilising this model led to improved results on different tasks involving Twitter posts, such as named entity recognition, part-of-speech tagging and text classification.
In this work, we investigate the effectiveness of combining pre-trained language models with static word embeddings. For earlier language models, most notably ELMo (Peters et al., 2018), it was common practice to combine contextual embeddings, predicted by the language model, with static word embeddings. However, the introduction of BERT has essentially eliminated the need for static word vectors in standard settings. On the other hand, several authors have shown that it can be beneficial to incorporate entity vectors with BERT, allowing the model to exploit factual or commonsense knowledge from structured sources (Lin et al., 2019;Poerner et al., 2019).

Proposed Approach
There are various ways in which BERT-based models can be combined with static word vectors. Note, however, that we cannot simply concatenate the contextualised word vectors predicted by BERT with the corresponding static word vectors, due to the fact that the tokenization strategy used by BERT means that many words are split into two or more word-piece tokens. One possible solution, adopted by Zhang et al. (2020) in a different setting, is to combine the word-piece tokens from the same word into a single vector, using a convolutional or recurrent neural network. The resulting word-level vector can then be concatenated with the corresponding static word vector. However, without a large training set, there is a risk that the representations predicted by BERT are degraded by this aggregation step. As a simpler solution, we instead combine representations obtained from BERT and from the static word vectors at sentence level. In particular, to obtain a sentence vector from the fine-tuned BERT model, we simply take the average of the predicted contextualised vectors. To obtain a sentence vector from the static word embeddings, we use either a Convolutional Neural Network (CNN) or a Long Short Term Memory network (LSTM). After concatenating the two types of sentence vectors, we apply dropout, followed by a softmax classification layer. A diagram illustrating the model is shown in Figure 1. Rather than jointly training the combined model, we first fine-tune the BERT model on its own. After this fine-tuning step, we freeze the BERT model and train the CNN and combined classification layer. We found this strategy to be more robust against over-fitting.

Experimental Results
We experimentally analyze the benefit of incorporating static word vectors, in both Arabic and English. For Arabic, we used the following datasets: L-HSAB A dataset for hate speech and abusive language in Arabic Levantine dialect (Mulki et al., 2019). It consists of 5846 tweets, which are annotated as normal, abusive or hate.
ArsenTD-Lev An Arabic Levantine dataset for sentiment analysis that discusses multiple topics (Baly et al., 2019). It contains 4000 tweets, labelled as a very negative, negative, neutral, positive, or very positive.
For English, we used the following datasets:  (Pennington et al., 2014). We have used the 100 dimensional word vectors that have been pre-trained on Twitter data (GloVe-twi), as well as the 100-dimensional GloVe vectors that have been trained on Wikipedia (GloVe-wiki).
Language Models. We use the pre-trained AraBERTv0.1 model 1 (Antoun et al., 2020) for Arabic and the BERT base uncased (Devlin et al., 2018) model for English.
Baselines and Methodology. As a baseline, we show the performance of a standard CNN, using only the static word vectors as input. We use 100 convolutional filters, a kernel size of 3, and a ReLU activation function. A global max-pooling layer follows the convolution layer. A dropout layer with a 0.5 drop rate is applied to the max-pooled output to avoid over-fitting. We use SGD with a batch size of 16 for 15 epochs with early stopping callback. We also show results for BERT and AraBERT alone. We followed the BERT TensorFlow implementation for sequence classification provided by Hugging Face (Wolf et al., 2019). Both AraBERT and the English BERT base pre-trained language models share the same architecture, which consists of 12 layers. We utilize the Adamax optimizer and batch size of 8. The hyper-parameters search for the fine-tuning process involves the number of epochs (3 to 6) and the learning rates [2e-5, 5e-5]. We chose the best performing hyper-parameters based on a validation split. We use the standard validation split for datasets where one is provided, and use 20% of the training data as validation otherwise. We fine-tune BERT on the whole training data once the best hyper-parameters are chosen.
For the CNN variant of our proposed hybrid approach, we use the same configuration as for the CNN baseline, i.e. we use a convolution layer with a filter size of 100, a kernel size of 3, and the ReLU activation function, followed by global max-pooling. For the LSTM variant, we use 100dimensional units. In both variants, the dropout rate is set to 0.5, and we use SGD with a batch size of 16, for 15 epochs, with the usage of early stopping callback.  Results. Table 1 summarizes the performance of the baseline models and the proposed strategy for the Arabic language, while Table 2 shows the results for English. The results for AraBERT and BERT are the best results that were obtained over three runs. We then fix this model and combine it with the CNN and LSTM models. The results of these combined models (and the CNN baseline) are averaged over 5 runs. We use this approach since the focus is on assessing whether the performance of BERT and AraBERT can be improved.
Overall, the proposed combined model improves the results across almost all datasets, with the CNN and LSTM variants performing broadly similarly. One exception for Arabic is the ArsenTD-Lev dataset, where the LSTM variant performs substantially better than the CNN variant. and the English Hate dataset, where neither of the two variants outperforms the fine-tuned BERT model. The underperformance on the Hate dataset is likely related to over-fitting, as there is a clear mismatch between training and test data in this dataset (e.g. in terms of annotation strategy and average tweet length). The most surprising finding is that the AraVec-twi and AraVec-wiki word embeddings achieve comparable performance for Arabic, and similarly, the GloVe-twi and GloVe-wiki embeddings achieve comparable performance for English. This suggests that the main improvements are not due to the fact that the word embeddings are specialized towards the social media genre, but rather because they capture complementary facets of word mean-ing. We conjecture that word vectors can, in particular, provide valuable complementary information for rare words. Schick and Schütze (2020) found that BERT struggles with rare words and we can indeed expect social media texts to contain a larger proportion of rare words than documents in other genres.

Conclusions
In this paper, we have presented a simple approach to combine static word embeddings with BERTbased language models. Intuitively, the reason why this hybrid approach can outperform the BERTbased models themselves is because the latter were not trained on Wikipedia. The alternative solution would be to train language models on a relevant social media corpus, as in the BERTweet model (Nguyen et al., 2020). While such a strategy is likely to lead to a better overall performance, in principle, this is not always feasible in practice. For instance, using static word vectors could play an important role in dealing with emerging terms, such as trending hashtags, as continuously updating language models (for many different languages) would be too expensive. Similarly, incorporating static word vectors seems to be a promising strategy for improving language models for low-resource languages, as specialized language models (e.g. trained on social media) are unlikely to become available for such languages.