On the importance of pre-training data volume for compact language models

Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.


Introduction
Over the past year, pre-trained language models have become the norm in Natural Language Processing. These large-scale Transformer-based (Vaswani et al., 2017) networks considerably advanced the state-of-the-art in language understanding (Devlin et al., 2019) via a two-step process: self-supervised learning on a vast text corpus followed by fine-tuning on a specific downstream task.
Following these advances, the ongoing trend has been to build bigger models with an ever-increasing amount of data (Liu et al., 2019;Raffel et al., 2020;Radford et al., 2019;Brown et al., 2020). However, pre-training models with billions of parameters over hundreds of gigabytes of text requires tremendous computational resources that only a few companies and institutions can afford. Besides, many languages and specific corpora (e.g. legal, scientific) are currently under-resourced. Hence, our goal is to explore model architectures and data volumes lowering the entry barrier to new research and practical applications.
We conduct experiments on French corpora in order to release the first French compact language models and to illustrate the training process in another language than English. Furthermore, we consider the question answering task since compact models may find their purpose in low-latency/faulttolerant information retrieval systems.

Problem statement
We intend to study the impact of pre-training data volume when training compact bidirectional Transformers (Devlin et al., 2019). We assume a scarce resources setting, both in terms of data and computing power. Two key aspects are explored: • The amount of pre-training data required to train high-performing compact language models.
• The importance of corpus-specific MLM before fine-tuning.
We use the French part of the OSCAR corpora (Ortiz Suárez et al., 2019) for pre-training and the FQuAD dataset 1 (d'Hoffschmidt et al., 2020) for machine reading comprehension fine-tuning. Moreover, the models under consideration are based on the CamemBERT (Martin et al., 2020) language model.

Related work
A wealth of work has recently been released (Ganesh et al., 2020) on compressing Transformerbased models (Vaswani et al., 2017;Devlin et al., 2019) through the pre-training of compact models (Turc et al., 2019), distillation (Hinton et al., 2015;Jiao et al., 2019;Sun et al., 2020), pruning McCarley et al., 2019;Sanh et al., 2020;Fan et al., 2020a) and quantization (Shen et al., 2019;Fan et al., 2020b). Nevertheless, absolute performance is not the end goal of this study. Rather, we investigate the training process of compact models in the absence of larger ones to distillate or prune. Furthermore, Sanh et al. (2020) acknowledge the difficulty of speeding up sparse models due to the absence of specialized hardware. Therefore, from an inference speed standpoint, it is currently preferable to train compact models.
Language models have been successfully pretrained on domain-specific corpora (Beltagy et al., 2019; and outperform their general-purpose counterparts on targetted downstream tasks. Still, training these models involved large datasets and computational resources out of reach for most. Multilingual models (Devlin et al., 2019;Lample and Conneau, 2019;Conneau et al., 2020) have been released to alleviate the need for languagespecific pre-training. While they offer competitive results, they usually lag behind monolingual models and require larger architectures. Martin et al. (2020) observed that large models did not improve on evaluation tasks when increasing the amount of pre-training data from 4 GB to 138 GB. They left as future work to question the need for large scale pre-training corpora with other model architectures and tasks.

OSCAR
OSCAR 2 (Ortiz Suárez et al., 2019) is a large-scale multilingual open source collection of corpora obtained by language classification and filtering of the Common Crawl corpus 3 . The whole French part amounts to 138 GB of text and it has already been used to train French language models (Martin et al., 2020). In this work, we only extract a sample of 4 GB of shuffled lines.

FQuAD
FQuAD (d'Hoffschmidt et al., 2020) is a recently introduced open source French native reading comprehension dataset. It consists of 60,000 questions and answers gathered on a set of 1,769 high-quality Wikipedia articles. In many aspects, it is the French equivalent of SQuAD 1.1 (Rajpurkar et al., 2016). Given a question and a paragraph, the task consists 2 https://oscar-corpus.com/ 3 https://commoncrawl.org/about/ in extracting from the paragraph the span of text answering the question.
We chose FQuAD as the fine-tuning dataset because it allows one to draw a direct parallel with its English counterpart (d'Hoffschmidt et al., 2020) and is one of the largest annotated French datasets. However, question answering is a notoriously difficult task for compact models (McCarley et al., 2019). While distillation has shown to improve their results on the GLUE benchmark (Wang et al., 2018) substantially, machine reading comprehension remains difficult to speed-up without incurring a significant drop in accuracy.

CamemBERT SMALL
CamemBERT (Martin et al., 2020) is a multi-layer bidirectional Transformer (Vaswani et al., 2017) with two architectures: base (12 layers, 768 hidden dimensions, 12 attention heads, 110M parameters) and large (24 layers, 1024 hidden dimensions, 16 attention heads, 355M parameters). It is very similar to RoBERTa (Liu et al., 2019). The main differences are the use of whole-word masking and SentencePiece tokenization (Kudo and Richardson, 2018) instead of subword-masking and byte-level Byte-Pair encoding (Sennrich et al., 2016;Radford et al., 2019). RoBERTa itself improves upon BERT by aggregating several modifications on top of the original architecture such as removing the next sentence prediction task, dynamic masking, and training with larger batches on more data.
We introduce CamemBERT SMALL 4 , a CamemBERT-based language model with a small architecture (12 layers, 256 hidden dimensions, 4 attention heads, 17M parameters). The main difference with the original CamemBERT lies in the use of subword-masking. Indeed, the authors later found out that whole-word masking had  at best a marginal impact on downstream task performance. Apart from inference speed and size considerations, two main factors explain this architectural choice: • This is the same architecture as ELECTRA SMALL++ (Clark et al., 2020), a recently released compact language model. Even though ELECTRA and CamemBERT differ in many regards (ELECTRA being trained as a discriminator rather than a generator), prior experiments conducted by Clark et al. (2020) give us an acceptable set of hyperparameters when pre-training and fine-tuning the model.  Table 1 shows that CamemBERT SMALL is much smaller and faster than its larger siblings. In a plausible setup for question answering systems, it provides, respectively, a 4.5-fold and 15-fold inference speed-up compared to CamemBERT BASE and CamemBERT LARGE while being 6.2 and 18.8 times smaller.

Experiments
Six overlapping subsets are built from the 4 GB OS-CAR sample. They are denoted as OSC 10 , OSC 100 , OSC 500 , OSC 1000 , OSC 2000 and OSC 4000 (the numbers indicating the number of MB). We extract an additional 10 MB sample from the corpus, which serves as a validation set for the self-supervised pre-training task. On the other hand, FQuAD consists of a train/dev split of 50,741 and 5,668 question/context pairs.
For each OSCAR subset, we pre-train a CamemBERT SMALL model with the standard masked language modeling (MLM) objective. Then we fine-tune the pre-trained models on the question answering task with the same span prediction method as BERT (Devlin et al., 2019). Between those two steps, an optional MLM step over the FQuAD train set is included. Table 2 shows the pre-training, intermediate MLM (if any) and fine-tuning hyperparameters. Fine-tuning being a brittle process (Dodge et al., 2020), fine-tuning results are averaged over 3 seeds.
The experiments described were implemented using Hugging Face's Transformers library (Wolf et al., 2019) and were conducted on an NVidia V100 16 GB. Martin et al. (2020) observed that complex downstream tasks may require more pre-training steps. Since for each OSCAR subset the validation loss is still slowly decreasing after 200k steps, we assume that training longer might increase performance on the difficult question answering task. On the other hand, corpus-specific MLM fine-tuning quickly converged for all models. Table 3 reports the entirety of the results.

Analysis
7.1 How much data does one need to pre-train a compact language model?
As we increase the amount of pre-training data, perplexity on the OSCAR dev set decreases in every instance but one (OSC 4000 ). Nevertheless, aside from OSC 10 , discrepancies are small and the models show almost identical learning curves. OSC 10 is underperforming in terms of MLM perplexity and question answering F1 score when compared to larger subsets. However, past this smallest dataset, pre-training data volume does not exhibit any strong monotonic relationship with downstream performance. The only OSCAR subset displaying a noticeable performance gap is OSC 2000 , with a +2.46 average F1 score increase over OSCAR 100 . For anchoring, a randomly initialized CamemBERT SMALL model "fine-tuned" directly on the FQuAD train set achieves an F1 score   Martin et al. (2020) obtain an F1 score of 88 and 92, respectively, after fine-tuning. Due to computational constraints, we could not investigate smaller or larger datasets as well as a prolonged pre-training phase. It could be the case that for a 200k pre-training steps budget, data volume is not the bottleneck. In fact, additional training steps may be even more beneficial for larger datasets. Nonetheless, a preliminary experiment pushing the pre-training phase of CamemBERT SMALL on OSC 2000 to 300k steps revealed that while the MLM loss decreased, the F1 score on the downstream task did not improve.

Is corpus-specific MLM beneficial?
Again, we observe a contrast between OSC 10 and larger subsets. OSC 10 is the only pre-training dataset significantly improving on the downstream task (+4.15 F1) and experiencing a decrease in perplexity on both pre-training and fine-tuning data when complemented with an intermediate MLM step. However, this corpus-specific MLM step is not truly intermediate since FQuAD contexts contain 10MB of raw text. This implies a 2-fold increase in pre-training data rather than a specific domain adaptation step. Therefore, we turn our focus to larger subsets for the rest of this analysis.
In these cases, MLM fine-tuning results in a net FQuAD perplexity decrease at the cost of an OSCAR perplexity increase. Domain shift may be the root cause of this trade-off. Indeed, as there exists a mismatch between pre-training and finetuning sets, the language model has to adapt to the specificity of descriptive paragraphs. In addition, perplexity is higher on the OSCAR dev set than on the FQuAD one. This is most likely due to the difficulty of predicting masked words in an heterogeneous web-crawled dataset compared to a set of high quality Wikipedia articles.
For every pre-training subset but one (OSC 2000 ), MLM fine-tuning induced a slight F1 score increase on the downstream task. However, these gains are marginal with at most a +0.75 average F1 score increase in the case of OSC 500 . Additional experiments are required to consolidate these findings, especially on larger task-specific datasets such as scientific or legal corpora. In those instances, a greater domain shift would probably legitimate an intermediate MLM fine-tuning step.

Conclusion
We investigated the importance of pre-training data volume when training compact Transformer-based models. We made the observation that 100 MB of raw text are sufficient to reach similar performance as with larger datasets on a question answering task, and that corpus-specific self-supervised learning does not bring significant improvements on that particular problem. These preliminary results pave the way for further experiments with other language models, various architectures and new downstream tasks.