To Pretrain or Not to Pretrain: Examining the Benefits of Pretrainng on Resource Rich Tasks

Pretraining NLP models with variants of Masked Language Model (MLM) objectives has recently led to a significant improvements on many tasks. This paper examines the benefits of pretrained models as a function of the number of training samples used in the downstream task. On several text classification tasks, we show that as the number of training examples grow into the millions, the accuracy gap between finetuning BERT-based model and training vanilla LSTM from scratch narrows to within 1%. Our findings indicate that MLM-based models might reach a diminishing return point as the supervised data size increases significantly.


Introduction
Language modeling has emerged as an effective pretraining approach in wide variety of NLP models. Multiple techniques have been proposed, including bi-directional language modeling (Peters et al., 2018), masked language models (Devlin et al., 2018), and variants of denoising auto-encoder approaches Raffel et al., 2019;. Today, it is rare to examine a leaderboard without finding the top spots occupied by some variant of a pretraining method. 1 The future of NLP appears to be paved by pretraining a universal contextual representation on wikipedia-like data at massive scale. Attempts along this path have pushed the frontier to up 10× to the size of wikipedia (Raffel et al., 2019). However, the success of these experiments is mixed: although improvements have been observed, the downstream task is usually data-limited. There is evidence that large-scale pretraining does not always lead to state-of-the-art results (Raffel et al., 2019), especially on tasks such as machine translation, where abundance of training data, and the 1 https://super.gluebenchmark.com/leaderboard existence of strong augmentation methods such as back translation might have limited the benefit of pretraining.
This paper examines the pretraining benefits of downstream tasks as the number of training samples increases. To answer this question, we focus on multi-class text classification since: (i) it is one of most important problems in NLP with applications spanning multiple domains. (ii) large sums of training data exists for many text classification tasks, or can be obtained relatively cheaply through crowd workers (Snow et al., 2008). We choose three sentiment classification datasets: Yelp review (yel, 2019), Amazon sports and electronics review (Ni et al., 2019), ranging in size from 6 to 18 million examples. 2 We finetune a RoBERTa model  with increments of the downstream dataset, and evaluate the performance at each increment. For example, on the Yelp dataset whose size is 6 million, we train the models on subsets of the data with each subset size being in the sequence (60k, 600K, 1.8M, 3M .., 6M). For comparison, we also train a vanilla BiLSTM, and another BiLSTM which uses pretrained Roberta token embeddings. We observe that when both models are trained on 1% of the data, the gap between BiLSTM and RoBERTa models is at its peak, but as the training dataset size increases, the BiLSTM model accuracy keeps on increasing whereas RoBERTa's accuracy remain mostly flat. As the dataset size increases, the accuracy gap shrinks to within 1%.
Our study suggests that collecting data and training on the target tasks is a solution worth considering, especially in production environments where accuracy is not the only considered factor, rather inference latency is often just as crucial. We benchmarked the inference latency of the these models on both CPU and GPU for different batch sizes, and as expected, we observe at least 20× speedup for the BiLSTM compared to the RoBERTa. This paper provides new experimental evidence and discussions for people to rethink the MLM pre-training paradigm in NLP, at least for resource rich tasks.

Related Works
Scaling the number of training examples has long been identified as source of improvement for machine learning models in multiple domains including NLP (Banko and Brill, 2001), computer vision (Deng et al., 2009;Sun et al., 2017) and speech (Amodei et al., 2016). Previous work has suggested that deep learning scaling may be predictable empirically (Hestness et al., 2017), with model size scaling sub-linearly with training data size. (Sun et al., 2017) concluded that accuracy increases logarithmally with respect to training data size. However, these studies have focused on training models in the the fully supervised setting, without pretraining.
One closer work is (He et al., 2019) where it is shown that randomly initialized standard computervision models perform no worse than their Ima-geNet pretrained counterparts. However, our work focuses on text classification. We do not examine the benefit of pretraining, at large, rather we focus on the benefit of pretraining for resource rich tasks. Another concurrent work that is still under review, in (Nakkiran and Sutskever, 2020) observes that, in some translation task such as IWSLT14, small language models exhibit even lower test loss compared to the large transformer model when the number of training samples increases.

Task and Data
We focus on a multi-class sentiment classification task: given the user reviews, predict the rating in five points scale {1, 2, 3, 4, 5}. The experiments are conducted on the following three benchmark datasets.
• Yelp Challenge (yel, 2019) contains text reviews, tips, business and check-in sets in Yelp. We use the 6.7m user reviews with ratings as our dataset.
• Amazon Reviews (Ni et al., 2019) contains product reviews (ratings, text, helpfulness votes) from Amazon. We choose two categories: sports / outdoors, and electronics as two separate datasets. We only use the review text as input features.
The distribution across five ratings of each dataset is illustrated in Table 1. In our experiment, all the above data is split into 90% for training and 10% for testing.

Models
We choose the following three types of pretrained and vanilla models: a transformer-based model pretrained with masked language modeling objectives on a large corpus. We finetune our classification task on both Roberta-Base (12 layers, 768 hidden, 12 heads) and Roberta-Large (24 layers, 1024 hidden, 16 heads).
• LSTM (Hochreiter and Schmidhuber, 1997) We use a bidirectional LSTM with a maxpooling layer on top of the hidden states, followed by a linear layer. Token embeddings of size 128 are randomly initialized.
• LSTM + Pretrained Token Embedding Similar to the previous setup, except we initialized the token embeddings with Roberta pretrained token embedding (Base: 768-dimensional embedding, Large: 1024dimensional embedding). The embeddings are frozen during training.
For fair comparison, all the above models share the same vocabulary and BPE tokenizer (Sennrich et al., 2015).

Experimental Setup
We use the Adam optimizer and the following hyperparameter sweep for each model. (i) RoBERTa is finetuned with the following learning rates {5e − 6, 1e5, 1.5e − 5, 2e − 5}, with linear warm up in the first 5% of steps followed by a linear

Impact of Data Size
We first investigate the effect of varying the number of training samples, for fixed model and training procedure. We train different models using {1%, 10%, 30%, 50%, 70%, 90%} amount of data to mimic the "low-resource", "medium-resource" and "high-resource" regime. Figure 1 shows that the accuracy delta between the LSTM and RoBERTa models at different percentages of the training data. From the plot, we observe the following phenomena: (i) Pretrained models exhibit a diminishing return behavior as the size of the target data grows. When we increase the number of training examples, the accuracy gap between Roberta and LSTM shrinks. For example, when both models are trained with 1% of the Yelp dataset, the accuracy gap is around 9%. However, as we increases the amount of training data to 90%, the accuracy gap drops to within 2%. The same behaviour is observed on both Amazon review datasets, with the initial gap starting at almost 5% for 1% of the training data, then shrinking all the way to within one point when most of the training data is used.
(ii) Using the pretrained RoBERTa token embeddings can further reduce the accuracy gap especially when training data is limited. For example, in the Yelp review data, a 4-layers LSTM with pretrained embeddings provides additional 3 per-cent gain compared to its counterparts. As Table 2 shows, an LSTM with pretrained RoBERTa token embeddings always outperforms the ones with random token initialization. This suggests that the embeddings learned during pretraining RoBERTa may constitute an efficient approach for transfer learning the knowledge learned in these large MLM.
We further report the accuracy metric of each model using all the training data. The full results are listed in Table 2. We observe that the accuracy gap is less than 1% on the Amazon datasets. even compared to 24 layers RoBERTa-large model. As for the Yelp dataset, the accuracy gap is within 2 percent from the RoBERTa-large model, despite an order of magnitude difference in the number of parameters.

Inference Time
We also investigate the inference time of the three type of models on GPU and CPU. The CPU inference time is tested on Intel Xeon E5-2698 v4 with batch size 128. The GPU inference time is tested on NVIDIA Quadro P100 with batch size ∈ {128, 256, 384}. The maximum sequence length is 512. We run 30 times for each settings and take the average. The results are listed in  Not surprisingly, the LSTM model is at least 20 time faster even when compared to the Roberta-Base. Note that the P100 will be out of memory when batch size is 384 for Roberta-Large. Another observation is that although using the Roberta pretrained token embedding introduces 10 times more model parameters compared to vanilla BiLSTM, the inference time only increases by less than 25%. This is due to the most additional parameters are from a simple linear transformation.

Discussion
Our findings in this paper indicate that increasing the number of training examples for 'standard' models such as LSTM leads to performance gains that are within 1 percent of their massively pretrained counterparts. Due to the fact that there is no good large scale question answering dataset, it is not clear if the same findings would hold on this type of NLP tasks, which are more challenging and semantic-based. In the future work, we will run more experiments if there are some other large scale open datasets. Despite sentiment analysis being a crucial text classification task, it is possible, though unlikely, that the patterns observed here are limited to sentiment analysis tasks only. The rationale behinds that is that pretrained LSTMs have kept up very well with transformer-based counterparts on many tasks (Radford et al.).
One way to interpret our results is that 'simple' models have better regularization effect when trained on large amount of data, as also evidenced in the concurrent work (Nakkiran and Sutskever, 2020).The other side of the argument in interpreting our results is that MLM based pretraining still leads to improvements even as the data size scales into the millions. In fact, with a pretrained model and 2 million training examples, it is possible to outperform an LSTM model that is trained with 3× more examples.

Conclusion
Finetuning BERT-style models on resource-rich downstream tasks is not well studied. In this paper, we reported that, when the downstream task has sufficiently large amount of training exampes, i.e., millions, competitive accuracy results can be achieved by training a simple LSTM, at least for text classification tasks. We further discover that reusing the token embeddings learned during BERT pretraining in an LSTM model leads to significant improvements. The findings of this work have significant implications on both the practical aspect as well as the research on pretraining. For industrial applications where there is a trade-off typically between accuracy and latency, our findings suggest it might be feasible to gain accuracy for faster models by collecting more training examples.