BERTweet: A pre-trained language model for English Tweets

We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet is trained using the RoBERTa pre-training procedure (Liu et al., 2019), with the same model configuration as BERT-base (Devlin et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet to facilitate future research and downstream applications on Tweet data. Our BERTweet is available at: https://github.com/VinAIResearch/BERTweet


Introduction
The language model BERT (Devlin et al., 2019)the Bidirectional Encoder Representations from Transformers (Vaswani et al., 2017)-and its variants have successfully helped produce new stateof-the-art performance results for various NLP tasks. Their success has largely covered the common English domains such as Wikipedia, news and books. For specific domains such as biomedical or scientific, we could retrain a domainspecific model using the BERTology architecture (Beltagy et al., 2019;Gururangan et al., 2020).
Twitter has been one of the most popular microblogging platforms where users can share realtime information related to all kinds of topics and events. The enormous and plentiful Tweet data has been proven to be a reliable and real-time source of information in various important analytic tasks (Ghani et al., 2019). Note that the characteristics of Tweets are generally different from those of traditional written text such as Wikipedia and news articles, due to the typical short length of Tweets and frequent use of informal grammar as well as irregular vocabulary e.g. abbreviations, typographical errors and hashtags (Eisenstein, 2013;Han et al., 2013). Thus this might lead to a challenge in applying existing language models pretrained on large-scale conventional text corpora with formal grammar and regular vocabulary to handle text analytic tasks on Tweet data.
To the best of our knowledge, there is not an existing language model pre-trained on a largescale corpus of English Tweets. To fill this gap, we train the first large-scale language model for English Tweets using a 80GB corpus of 850M English Tweets. Our model uses the BERT base model configuration, trained based on the RoBERTa pretraining procedure (Liu et al., 2019). We evaluate our model and compare it with strong competitors, i.e. RoBERTa base and XLM-R base (Conneau et al., 2020), on three downstream Tweet NLP tasks: Part-of-speech (POS) tagging, Namedentity recognition (NER) and text classification. Experiments show that our model outperforms RoBERTa base and XLM-R base as well as the previous state-of-the-art (SOTA) models on all these tasks. Our contributions are as follows: • We present the first large-scale pre-trained language model for English Tweets.
• Our model does better than its competitors RoBERTa base and XLM-R base and outperforms previous SOTA models on three downstream Tweet NLP tasks of POS tagging, NER and text classification, thus confirming the effectiveness of the large-scale and domain-specific language model pre-trained for English Tweets.
• We also examine whether a commonly used approach of applying lexical normalization dictionaries on Tweets (Han et al., 2012) would help improve the performance of the pre-trained lan-guage models on the downstream tasks.
• We publicly release our model under the name BERTweet which can be used with fairseq  and transformers (Wolf et al., 2019). We hope that BERTweet can serve as a strong baseline for future research and applications of Tweet analytic tasks.

BERTweet
In this section, we outline the architecture, and describe the pre-training data and optimization setup that we use for BERTweet.

Architecture
Our BERTweet uses the same architecture configuration as BERT base , which is trained with a masked language modeling objective (Devlin et al., 2019). BERTweet pre-training procedure is based on RoBERTa (Liu et al., 2019) which optimizes the BERT pre-training approach for more robust performance. Given the widespread usage of BERT and RoBERTa, we do not detail the architecture here (see Devlin et al. (2019) and Liu et al. (2019) for more details).

Pre-training data
We use an 80GB pre-training dataset of uncompressed texts, containing 850M Tweets (16B word tokens). Here, each Tweet consists of at least 10 and at most 64 word tokens. In particular, this dataset is a concatenation of two corpora: • We first download the general Twitter Stream grabbed by the Archive Team, 1 containing 4TB of Tweet data streamed from 01/2012 to 08/2019 on Twitter. To identify English Tweets, we employ the language identification component of fastText (Joulin et al., 2017). We tokenize those English Tweets using "TweetTokenizer" from the NLTK toolkit (Bird et al., 2009) and use the emoji package to translate emotion icons into text strings (here, each icon is referred to as a word token). 2 We also normalize the Tweets by converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively. We filter out retweeted Tweets and the ones shorter than 10 or longer than 64 word tokens. This process results in the first corpus of 845M English Tweets.
• We also stream Tweets related to the COVID-19 pandemic, available from 01/2020 to 03/2020. 3 We apply the same data processing step as described above, thus resulting in the second corpus of 5M English Tweets.
We then apply fastBPE (Sennrich et al., 2016) to segment all 850M Tweets with subword units, using a vocabulary of 64K subword types. On average there are 25 subword tokens per Tweet.

Optimization
We utilize the RoBERTa implementation in the fairseq library . We set a maximum sequence length at 128, thus generating 850M × 25 / 128 ≈ 166M sequence blocks. Following Liu et al. (2019), we optimize the model using Adam (Kingma and Ba, 2014), and use a batch size of 7K across 8 Nvidia V100 GPUs (32GB each) and a peak learning rate of 0.0004. We train BERTweet with 40 epochs for about 4 weeks (and use the first 2 epochs for warming up the learning rate), equivalent to 166M × 40 / 7K ≈ 950K training steps.

Experimental setup
We evaluate and compare the performance of BERTweet with strong baselines on three downstream NLP tasks of POS tagging, NER and text classification, using benchmark Tweet datasets.
For Ritter11-T-POS, we use a 70/15/15 training/validation/test pre-split available from Gui et al. (2017). 6 ARK-Twitter contains two files daily547.conll and oct27.conll in which oct27.conll is further split into files oct27.traindev and oct27.test. Following Owoputi et al. (2013) and Gui et al. (2017), we employ daily547.conll as a test set. We then use oct27.traindev and oct27.test as training and validation sets, respectively. For the TWEEBANK-V2, WNUT16 and WNUT17 datasets, we use their own available training/validation/test split. The SemEval2017-Task4A and SemEval2018-Task3A datasets are provided with training and test sets only (i.e. there is not a standard split for validation), thus we sample 10% of the training set for validation and use the remaining 90% for training. We apply a "soft" normalization strategy to all of the experimental datasets by translating word tokens of user mentions and web/url links into special tokens @USER and HTTPURL, respectively, and converting emotion icon tokens into corresponding strings. We also use a "hard" strategy by further applying lexical normalization dictionaries (Aramaki, 2010;Liu et al., 2012;Han et al., 2012) to normalize word tokens in Tweets.

Fine-tuning
Following Devlin et al. (2019), for POS tagging and NER, we append a linear prediction layer on top of the last Transformer layer of BERTweet with regards to the first subword of each word token, while for text classification we append a linear prediction layer on top of the pooled output.
We employ the transformers library (Wolf et al., 2019) to independently fine-tune BERTweet for each task and each dataset in 30 training epochs. We use AdamW (Loshchilov and Hutter, 2019) with a fixed learning rate of 1.e-5 and a batch size of 32 (Liu et al., 2019). We compute the task performance after each training epoch on the validation set (here, we apply early stopping when no improvement is observed after 5 continuous epochs), and select the best model checkpoint to compute the performance score on the test set.
We repeat this fine-tuning process 5 times with different random seeds, i.e. 5 runs for each task and each dataset. We report each final test result as an average over the test scores from the 5 runs.

Baselines
Our main competitors are the pre-trained language models RoBERTa base (Liu et al., 2019) and XLM-R base (Conneau et al., 2020), which   (Owoputi et al., 2013) on Ritter11 is reported in the TPANN paper (Gui et al., 2017). Note that Ritter11 uses Twitter-specific POS tags for retweeted (RT), user-account, hashtag and url word tokens which can be tagged perfectly using some simple regular expressions. Thus we follow Gui et al. (2017), Gui et al. (2017) and Gui et al. (2018) to tag those words appropriately for all models. Results of ARKtagger and BiLSTM-CNN-CRF (Ma and Hovy, 2016) on TB-v2 are reported by . "+a", "+b" and "+c" denote the additional use of extra training data, i.e. models trained on bigger training data. "+a": additional use of the POS annotated data from the English WSJ Penn treebank sections 00-24 (Marcus et al., 1993). "+b": the use of both training and validation sets for learning models. "+c": additional use of the POS annotated data from the UD English-EWT training set (Silveira et al., 2014).
have the same architecture configuration as our BERTweet. In addition, we also evaluate the pretrained RoBERTa large and XLM-R large although it is not a fair comparison due to their significantly larger model configurations.
The pre-trained RoBERTa is a strong language model for English, learned from 160GB of texts covering books, Wikipedia, CommonCrawl news, CommonCrawl stories, and web text contents. XLM-R is a cross-lingual variant of RoBERTa, trained on a 2.5TB multilingual corpus which contains 301GB of English CommonCrawl texts.
We fine-tune RoBERTa and XLM-R using the same fine-tuning approach we use for BERTweet.

Main results
Tables 1, 2, 3 and 4 present our obtained scores for BERTweet and baselines regarding both "soft" and "hard" normalization strategies. We find that for each pre-trained language model the "soft" scores   Limsopatham and Collier (2016). "entity" and "surface" denote the scores computed for the standard entity level and the surface level (Derczynski et al., 2017), respectively.   are generally higher than the corresponding "hard" scores, i.e. applying lexical normalization dictionaries to normalize word tokens in Tweets generally does not help improve the performance of the pre-trained language models on downstream tasks.
Our BERTweet outperforms its main competitors RoBERTa base and XLM-R base on all experimental datasets (with only one exception that XLM-R base does slightly better than BERTweet on Ritter11-T-POS). Compared to RoBERTa large and XLM-R large which use significantly larger model configurations, we find that they obtain better POS tagging and NER scores than BERTweet. How-ever, BERTweet performs better than those large models on the two text classification datasets.
Tables 1, 2, 3 and 4 also compare our obtained scores with the previous highest reported results on the same test sets. Clearly, the pre-trained language models help achieve new SOTA results on all experimental datasets. Specifically, BERTweet improves the previous SOTA in the novel and emerging entity recognition by absolute 14+% on the WNUT17 dataset, and in text classification by 5% and 4% on the SemEval2017-Task4A and SemEval2018-Task3A test sets, respectively. Our results confirm the effectiveness of our large-scale BERTweet for Tweet NLP.

Discussion
Our results comparing the "soft" and "hard" normalization strategies with regards to the pretrained language models confirm the previous view that lexical normalization on Tweets is a lossy translation task (Owoputi et al., 2013). We find that RoBERTa outperforms XLM-R on the text classification datasets. This finding is similar to what is found in the XLM-R paper (Conneau et al., 2020) where XLM-R obtains lower performance scores than RoBERTa for sequence classification tasks on traditional written English corpora.
We also recall that although RoBERTa and XLM-R use 160 / 80 = 2 times and 301 / 80 ≈ 3.75 times bigger English data than our BERTweet, respectively, BERTweet does better than its competitors RoBERTa base and XLM-R base . Thus this confirms the effectiveness of a large-scale and domain-specific pre-trained language model for English Tweets. In future work, we will release a "large" version of BERTweet, which likely performs better than RoBERTa large and XLM-R large on all three evaluation tasks.

Conclusion
In this paper, we have presented the first largescale language model BERTweet pre-trained for English Tweets. We demonstrate the usefulness of our BERTweet by showing that BERTweet outperforms its baselines RoBERTa base and XLM-R base and helps produce better performances than the previous SOTA models for three downstream Tweet NLP tasks of POS tagging, NER, and text classification. By publicly releasing BERTweet, we hope that it can foster future research and applications of Tweet analytic tasks.