Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.


Introduction
Sequence transfer learning (Ruder, 2019), that pretrains language representations on unlabeled text (source) and then adapts these representations to a supervised task (target), has demonstrated its effectiveness on a range of NLP tasks (Radford et al., 2018;Devlin et al., 2019;. Approaches vary in model, pretraining objective, pretraining data and adaptation strategy. We consider a widely used method, BERT (Devlin et al., 2019). It pretrains a transformer-based model using a masked language model objective and then fine-tunes the model on the target task. We investigate the impact of the domain (i.e., the similarity between the underlying distribution of source and target data) of pretraining data on the effectiveness of pretrained models. We also propose a cost-effective way to select pretraining data.
Recent studies on domain-specific BERT models, which are pretrained on specialty source data, empirically show that, when in-domain data is used for pretraining, target task performance can be improved Alsentzer et al., 2019;Huang et al., 2019;Beltagy et al., 2019). These publicly available domain-specific BERT models are valuable to the NLP community. However, the selection of in-domain data usually resorts to intuition, which varies across NLP practitioners (Dai et al., 2019). According to Halliday and Hasan (1989), the context specific usage of language is affected by three factors: field (the subject matter being discussed), tenor (the relationship between the participants in the discourse and their purpose) and mode (communication medium, e.g., 'spoken' or 'written'). 1 Generally, the selection of pretraining data in existing domain-specific BERT models is based on the field rather than the tenor. For example, BioBERT  and SciB-ERT (Beltagy et al., 2019) are both pretrained on scholar articles, but on different fields (biology and computer science).
We conduct a case study of pretraining BERT on social media text which has very different tenor from existing domain-specific BERT models. Our contributions are two-fold: (1) We release two pretrained BERT models trained on tweets and forum text, and we demonstrate the effectiveness of these two resources on a range of NLP data sets using social media text; and, (2) we investigate the correlation of source-target similarity and task accuracy using different domain-specific BERT models. We find that simple similarity measures can be used to nominate in-domain pretraining data ( Figure 1).

Related Work
Selecting data to pretrain BERT There are two known strategies: (1) collecting very large generic data, such as web crawl and news (Radford et al., 2019;Baevski et al., 2019); and, (2) selecting in-domain data, which we refer to as  Figure 1: Recent studies have demonstrated the effectiveness of domain-specific BERT models. However, the selection of in-domain data usually resorts to intuition, which varies across NLP practitioners, especially regarding intersecting domains. We investigate the correlation of source-target similarity and the effectiveness of pretrained models. In other words, we aim to use simple similarity measures to nominate in-domain pretraining data.
domain-specific BERT models. Those following the first strategy intend to build universal language representations that are useful across multiple domains. They also believe that pretraining on larger data leads to better pretrained models. For example, Baevski et al. (2019) empirically show that the average GLUE score (Wang et al., 2019) can increase from lower than 80 to higher than 81 when the size of pretraining data increases from 562 million to 18 billion tokens.
Our study uses the second strategy. However, we select our pretraining data from the tenor perspective rather than the field. A summary of the source data used in these domain-specific BERT models can be found in Table 1.
Finding in-domain data Our study relates to the literature on investigating domain similarity (Blitzer et al., 2006;Ben-David et al., 2007;Ruder and Plank, 2017) and text similarity (Mihalcea et al., 2006;Pavlick et al., 2015;Kusner et al., 2015). Our work is also inspired by the study by Dai et al. (2019) on the impact of source data on pretrained LSTM-based models (i.e., ELMo) and by Van Asch and Daelemans (2010) on the correlation between similarity and accuracy loss of POS taggers.

Pretraining BERT Models
We follow the practices used in other domainspecific BERT models Beltagy et al., 2019) to pretrain our BERT models. We use the original vocabulary of BERT-Base as our  underlying word piece vocabulary 2 and use the pretrained weights from the original BERT-Base as the initialization weights. Note that all domainspecific models we consider in this study are based on this paradigm, 3 which means these models are supposed to capture both generic (inheriting from original BERT) and domain-specific knowledge.
For pretraining objective, we remove the Next Sentence Prediction (NSP) objective. Social media text, especially tweets, are often too short to sample consecutive sentences. In addition, recent studies observe benefits in removing the NSP objective with sequence-pair training .
Twitter We use English tweets ranging from Sep 1 to Oct 30, 2018 4 to pretrain our Twitter BERT. There are in total 60 million English tweets, consisting of 0.9B tokens. Although we aim to avoid tailored pre-processing strategies to make a fair comparison with other domain-specific BERT models, we find 44% of these tweets contain url and 78% contain other user names (@, if a tweet replies another tweet, @ is added automatically). We thus employ minimal processing by: (1) replacing tokens starting with '@', referring to a Twitter user's account name, with a special token [TwitterUser]; and, (2) replacing urls as a special token [URL]. We hypothesize that the surface form of these tokens do not contain useful information.
Forum We use local businesses reviews released by Yelp 5 to pretrain our Forum BERT. There are in total five million reviews, consisting of 0.6B tokens. No preprocessing is conducted on the text.
We used four Nvidia P100 GPUs for the pretraining. Training of each model took seven days.

Effectiveness of Pretrained BERT Models
To evaluate the effectiveness of our pretrained BERT models, we experiment on a range of classification and Named Entity Recognition (NER) data sets. Both text classification and NER are fundamental NLP tasks that can employ generic architectures on top of BERT. For the classification task, the representation of the first token (i.e., [CLS]) is fed into the output layer for the final prediction. For the NER task, the representations of the first sub-token within each token are taken as input to a token-level classifier to predict the token's tag. We did not explore more complex architectures, such as adding LSTM or CRF on top of BERT (Beltagy et al., 2019;Baevski et al., 2019), because our aim is to demonstrate the efficacy of domain-specific BERT models and to observe the impact of pretraining data, rather than to achieve state-of-the-art performance on these data sets.
Our BERT results follow the standard twostage approach of finetuning the pretrained model. Domain-specific BERTs add a stage in the middle: finetuning BERT on domain-specific unlabeled data (cf. Figure 1).

Target Tasks
We use eight target tasks with their text sampled from Twitter and forums, to examine whether our BERT models can lead to improvements, compared to the original BERT. These tasks are Airline 6 : classifying sentiment on tweets about major U.S. airlines; BTC: identifying location, person, and organization on tweets (Derczynski et al., 2016); SMM4H-18: classifying whether the user reports an adverse drug events (task3) (Weissenbacher et al., 2018), or intends to receive a seasonal influenza vaccine (task4) on tweets about health (Joshi et al., 2018); CADEC: identifying adverse drug events etc. on reviews about medications (Karimi et al., 2015); SemEval-14: identifying product or service attributes on reviews about laptops and restaurants (Pontiki et al., 2014); SST: classifying sentiment on movie reviews (Socher et al., 2013).
In addition, we use four tasks that do not use social media text to investigate how our BERT models perform on out-of-domain target tasks: Paper

Results
We observe that our BERT models achieve the highest F1 score on 6 out of 8 target tasks that use social media text (Table 2). On CADEC (medications) and SemEval-14 laptop, SciBERT achieves the highest score due to the overlapping fields (i.e., medication and computer hardware, respectively). We note, however, that our Forum BERT achieves very close results. This demonstrates the effectiveness of our pretrained models on target tasks using social media text. To our surprise, on target tasks using tweets, forum BERT achieves better results than Twitter BERT on 3 classification tasks. On one hand, this may be explained by Baldwin et al. (2013)'s observation that forum text is the 'median' data, which is similar to all other types of 4 Internet archive, Accessed 1 June 2020. 5 Yelp Challenge, Accessed 1 June 2020. 6 Kaggle Twitter US Airline Sentiment Challenge  Table 2: Effectiveness of different BERT models, evaluated on downstream tasks. # tokens in each pretraining data are listed in brackets. C: Classification task, for which we report macro-F1; N : NER task, for which we report span-level micro-F1. We repeat all experiments five times with different random seeds. Mean values are reported. underline: the best result is significantly better than the second best result (paired student's t-test, p: 0.05).

ForumBERT
SciBERT 159 36 43 161  social media text. On the other hand, it also reveals the challenge of pretraining contextual language representations on short tweets. We also observe that, when domain-specific models are applied on a target task with out-ofdomain data, they achieve much lower results than the original BERT. For example, BioBERT achieves lower results than the original BERT on 7 out of 8 target social media tasks. It only achieves a better result on CADEC, which is about medications. Recall that all these domain-specific BERT models use the pretrained weights of the original BERT as initialization. On one hand, we argue that this observation may challenge the conventional wisdom that the larger the pretraining data is, the better the pretrained model is. Training on out-of-domain source data may cause negative impact, at least for the two-stage pretraining approach we consider. On the other hand, this observation reinforces recent work showing the importance of task-adaptive pretraining (Gururangan et al., 2020).
Error analysis on CADEC We conduct an error analysis on CADEC, because it is at the intersection between social media tenor (online posts) and medication field (adverse drug events), and thus could be similar to multiple sources. We compare the error predictions by the two best performing BERT models, ForumBERT and SciBERT, as well as the baseline BERT model. In Table 3, we observe that both domain-specific BERT models can reduce greatly the number of false positives made by the baseline BERT. Specifically, 159 false positives made by the baseline BERT are fixed by the domain-specific BERT models. However, domainspecific BERT models do not reduce much of the number of false negatives. There are 258 gold mentions recognized by none of three models, and only 41 false negatives by the baseline BERT are fixed by the domain-specific BERT models (Table 4).

Analysis
After we empirically show the importance of selecting in-domain source data, the next question is: can we find a cost-effective way to nominate in-domain source data?

Measuring Similarity
We use three measures of the similarity between source and target data. We then observe whether these similarity values correlate with the usefulness of pretrained models in § 5.2.
Language model perplexity (PPL) has been used to provide a proxy to estimate corpus similarity (Baldwin et al., 2013). We construct Kneser-Ney smoothed 3-gram models (Heafield, 2011) on source data and use the perplexity of target data relative to these language models as the similarity between source and target data.
Jensen-Shannon divergence (JSD), based on term distributions, has been successfully used for domain adaptation (Ruder and Plank, 2017). We first measure the probability of each term (up to 3-gram) in source and target data, separately. Then, we use the Jensen-Shannon divergence between these two probability distributions as the similarity between source and target data.
Target vocabulary covered (TVC) measures the percentage of the target vocabulary present in the source data, where only content words (nouns, verbs, adjectives) are counted. Dai et al. (2019) show that it is very informative in predicting the effectiveness of pretrained word vectors. In addition, Ruder and Plank (2017) show that the diversity of source data is as important as domain similarity for domain adaptation. Inspired by this, we also explore a very simple diversity measure: type token ratio (TTR, # unique tokens # tokens ), that measures the lexical diversity of the source data.
To mitigate the impact of source data size on these measurements, for each source data, we sample five sub-corpora, each of which contains 10M tokens. Then we measure the similarity of source and target data and the diversity of source data as the average values of these sub-corpora.

Correlation Analysis
To analyze how the effectiveness of domainspecific BERT models correlate to the similarity between source and target data, we employ the Pearson correlation analysis to find out the relationships between improvements due to domain-specific BERT models and similarity between source and target data. For example, considering the BTC task, we use the performance of the original BERT as baseline, and measure the improvement due to Twitter BERT as 1.0, whereas the corresponding value using BioBERT is −2.9. Note that we repeat all the experiments five times; therefore, we collect 300 source-target data points in total.
The correlation results are visualized in Figure 2. JSD has the strongest correlation (0.519) with the improvement due to domain-specific models, while the other two measures also have modest correlation (0.481 for PPL and 0.436 for TVC). Recall that the calculation of JSD takes uni-grams, bigrams and tri-grams into consideration, whereas PPL considers tri-grams only and the TVC considers uni-grams only. Correlations between different measures indicate that these measures are able to reach agreement on whether source and target are similar. We find no correlation between the TTR of source data and the improvement.

Summary
We conduct a case study of pretraining BERT on social media text. Through extensive experiments, we show the importance of selecting in-domain source data. Based on empirical analysis, we recommend measures to help select pretraining data for best performance on new applications.