How Many Data Points is a Prompt Worth?

When fine-tuning pretrained models for classification, researchers either use a generic model head or a task-specific prompt for prediction. Proponents of prompting have argued that prompts provide a method for injecting task-specific guidance, which is beneficial in low-data regimes. We aim to quantify this benefit through rigorous testing of prompts in a fair setting: comparing prompted and head-based fine-tuning in equal conditions across many tasks and data sizes. By controlling for many sources of advantage, we find that prompting does indeed provide a benefit, and that this benefit can be quantified per task. Results show that prompting is often worth 100s of data points on average across classification tasks.


Introduction
The main paradigm for adapting pretrained models for classification (Radford, 2018;Dong et al., 2019;Devlin et al., 2018) is fine-tuning via an explicit classifier head. However, an alternative approach has arisen: adapting the pretrained language model directly as a predictor through autoregressive text generation (Radford et al., 2019) or completion of a cloze task (Trinh and Le, 2018). This method is notably used in T5 fine-tuning (Raffel et al., 2019) leading to state-of-the-art results on the SuperGLUE benchmark .
One argument made for classification by direct language generation is that it allows us to pick custom prompts for each task (McCann et al., 2018). While this approach can be used for zero-shot classification (Puri and Catanzaro, 2019) or priming (Brown et al., 2020), it can also be used in fine-tuning to provide extra task information to the classifier, especially in the low-data regime (Schick and Schütze, 2020a,b).
Code available at https://github.com/TevenLeScao/pet If this argument is indeed true, it is natural to ask how it impacts the sample efficiency of the model, or more directly, how many data points is a prompt worth? As with many low-data and pretrainingbased problems, this question is complicated by the fine-tuning setup, training procedure, and prompts themselves. We attempt to isolate these variables through diverse prompts, multiple runs, and best practices in low-training data fine-tuning. We introduce a metric, the average data advantage, for quantifying the impact of a prompt in practice.
Our experiments find that the impact of tasktargeted prompting can nicely be quantified in terms of direct training data, and that it varies over the nature of different tasks. On MNLI (Williams et al., 2018), we find that using a prompt contributes approximately 3500 data points. On Su-perGLUE, it adds approximately 280 data points on RTE (Dagan et al., 2005) and up to 750 on BoolQ (Clark et al., 2019). In low-to medium-data settings, this advantage can be a real contribution to training a model.

Related Work
Prompting has been used both for zero-shot and fine-tuning based methods. Zero-shot approaches attempt to answer a task with a prompt without finetuning through generation (Radford et al., 2019). GPT3 (Brown et al., 2020) extends this approach to a supervised priming method by taking in training data as priming at inference time, so it can attend to them while answering. T5 (Raffel et al., 2019) and other sequence-to-sequence pretrained models use standard word-based fine-tuning with a marker prompt to answer classification tasks with strong empirical success. Our setting differs in that we are interested in using task-based prompts and finetuning, in-between the T5 and GPT2 setting.
Our setting most closely resembles PET (Schick and Schütze, 2020a,b), which claims that taskspecific prompting helps transfer learning, espe-cially in the low-data regime. However, in order to reach the best possible results on SuperGLUE, PET introduces several other extensions: semisupervision via additional pseudo-labeled data, ensembling models trained with several different prompts, and finally distilling the ensemble into a linear classifier rather than a language model. Our aim is to isolate the specific contributions of prompting within supervised fine-tuning.
Finally, recent papers have experimented with discovering prompts through automated processes tailored to the language model (Jiang et al., 2020;Schick et al., 2020). We limit ourselves to humanwritten prompts, as we are interested into whether prompting itself specifically adds information to the supervised task. It is an interesting question as to whether automatic prompts can have this same impact (relative to the training data they require).

Comparison: Heads vs Prompts
Consider two transfer learning settings for text classification: head-based, where a generic head layer takes in pretrained representations to predict an output class; prompt-based, where a task-specific pattern string is designed to coax the model into producing a textual output corresponding to a given class. Both can be utilized for fine-tuning with supervised training data, but prompts further allow the user to customize patterns to help the model. For the prompt model we follow the notation from PET (Schick and Schütze, 2020a) and decompose a prompt into a pattern and a verbalizer. The pattern turns the input text into a cloze task, i.e. a sequence with a masked token or tokens that need to be filled. Let us use as example an excerpt from SuperGLUE task BoolQ (Clark et al., 2019), in which the model must answer yes-or-no binary questions. In order to let a language model answer the question in italics, our pattern is in bold (Schick and Schütze, 2020b): "Posthumous marriage -Posthumous marriage (or necrogamy) is a marriage in which one of the participating members is deceased. It is legal in France and similar forms are practiced in Sudan and China. Since World War I, France has had hundreds of requests each year, of which many have been accepted. Based on the previous passage, can u marry a dead person in france ? <MASK>" The masked word prediction is mapped to a verbalizer which produces a class. (here "Yes": True. "No": False 1 ). Several pattern-verbalizer pairs (PVPs) could be used for a single task, differing either through the pattern, the verbalizer, or both. Fine-tuning is done by training the model to produce the correct verbalization. The loss is the cross-entropy loss between the correct answer and the distribution of probabilities amongst the tokens in the verbalizer. We re-use pattern choices from Schick and Schütze (2020b); examples are available in Appendix A.

Experimental Setting
We run all experiments with the same pretrained checkpoint, roberta-large (355M parameters) from RoBERTa , which we load from the transformers (Wolf et al., 2020) library. 2 In line with previous observations (McCoy et al., 2019;Dodge et al., 2020;Lee et al., 2020), head-based fine-tuning performance varies considerably. We follow recommendations of Mosbach et al. (2020) and Zhang et al. (2020) to train at a low learning rate (10 −5 ) for a large number of steps (always at least 250, possibly for over 100 epochs).
We perform our evaluation on SuperGLUE and MNLI (Williams et al., 2018). These datasets comprise a variety of tasks, all in English, including entailment (MNLI, RTE (Dagan et al., 2005), CB (de Marneffe et al., 2019)), multiple choice question answering (BoolQ (Clark et al., 2019), MultiRC (Khashabi et al., 2018)), and commonsense reasoning (WSC (Levesque et al., 2012), COPA (Roemmele et al., 2011), WiC (Pilehvar and Camacho-Collados, 2018)). We do not include ReCoRD (Zhang et al., 2018) in our comparisons as there is no head model to compare with, since it is already a cloze task. Data sizes range from 250 data points for CB to 392, 702 for MNLI. As test data is not publicly available for SuperGLUE tasks, we set aside part of training (from 50 for CB, COPA and MultiRC to 500 for BoolQ) to use for development, and evaluate on their original validation sets. For MNLI, we use the available matched validation and test sets.
We compare models across a scale of available data, starting with 10 data points and increasing exponentially (as high-data performance tends to Figure 1: Prompting vs head (classifier) performance across data scales, up to the full dataset, for six SuperGLUE tasks. Compares the best prompt and head performance at each level of training data across 4 runs. Highlighted region shows the accuracy difference of the models. Cross-hatch region highlights the lowest-and highest-accuracy matched region in the curves. The highlighted area in this region is used to estimate the data advantage. saturate) to the full dataset. For example, for Mul-tiRC, which has 969 data points initially, we start by reserving 50 data points for development. This leaves us with 919 training points, and we train models with 10,15,20,32,50,70,100,150,200,320,500,750, and 919 training points. We run every experiment 4 times in order to reduce variance, for a total of 1892 training runs across all tasks. At every point, we report the best performance that has been achieved at that amount of data or lower. Full graphs are presented in Appendix B. Figure 1 shows the main results comparing head-and prompt-based fine-tuning with the bestperforming pattern on that task. Prompting enjoys a substantial advantage on every task, except for WiC as is reported in previous results (Schick and Schütze, 2020b). Both approaches improve with more training data, but prompting remains better by a varying amount. Many tasks in SuperGLUE have relatively few data points, but we also see an advantage in large datasets like BoolQ and MNLI.

Results
To quantify how many data points the prompt is worth, we first isolate the y-axis band of the lowestand highest-accuracy where the two curves match in accuracy. 3 The horizontal line at these points represents the advantage of prompting. We then take the integral in this region, i.e. area between the linearly-interpolated curves 4 , divided by the height of the band. The area has the dimension of a quantity of data points times the metric unit, so dividing by the performance range yields a # of data points advantage. As low data training is sensitive to noise, in addition to following best training practices we run several different experiments for each x-point. We use a bootstrapping approach to estimate confidence over these runs. Specifically, we hold out one of the 4 head runs and 4 prompt runs (16 combinations total), and compute the standard deviation of those outcomes.
We report these quantities for every task in Table 1 as Average advantage. For almost all the tasks, we see that prompting gives a substantial advantage in terms of data efficiency, adding the equivalent of hundreds of data points on average.

Analysis
Impact of Pattern vs Verbalizer The intuition of prompts is that they introduce a task description in natural language, even with few training points. To better understand the zero-shot versus adaptive nature of prompts, we consider a null verbalizer, a control with a verbalizer that cannot yield semantic information without training. For every task that requires filling in one word (which excludes . N indicates a null-verbalizer prompting task that replaces the verbalizer with a non-sensical mapping. *The comparison band of MultiRC is too small as the head baseline fails to learn beyond majority class; we use the full region for a lower-bound result. the more free-form COPA and WSC), we replace the verbalizers, for example, "yes", "no", "maybe", "right" or "wrong", with random first names. Table 1 shows the advantage of the standard prompt over the null verbalizer to estimate this control. We see that for small data tasks such as CB, the null verbalizer removes much of the benefits of prompting. However, with more training data, the model seems to adapt the verbalizer while still gaining the inductive bias benefits of the pattern. Figure 2 showcases this dynamic on MNLI. This result further shows that prompting yields data efficiency even if it is not directly analogous to the generation process of training.

Impact of Different Prompts
If the prompt acts as a description of the task, one would expect different valid descriptions to vary in their benefits. In order to compare the different prompts we used on each task, we chart the median performance for each of them under different runs. In nearly every experiment, we find that the confidence intervals of those curves largely overlap, implying that prompt choice is not a dominant hyperparameter, i.e. the variance across random seeds usually outweighs the possible benefits of prompt choice. One ex- ception is the low-data regime of BoolQ, where one of the prompts enjoys a significant few-shot advantage over the others. We plot this curve for MultiRC in Figure 3 and the rest in Appendix C.
Metric sensitivity We treat each metric linearly in calculating advantage; alternatively, we could reparameterize the y axis for each task. This choice does not have a consistent effect for or against prompting. For example, emphasizing gains close to convergence increases prompting advantage on CB and MNLI but decreases it on COPA or BoolQ.

Conclusion
We investigate prompting through a systematic study of its data advantage. Across tasks, prompting consistently yields a varying improvement throughout the training process. Analysis shows that prompting is mostly robust to pattern choice, and can even learn without an informative verbalizer. On large datasets, prompting is similarly helpful in terms of data points, although they are less beneficial in performance. In future work, we hope to study the mechanism and training dynamics of the prompting benefits.
Significant compute resources were used to run this paper's experiments. A single experiment (defined as one model run, at one data level, on one task) was quite light-weight, taking usually a little under an hour on a single Nvidia V100. However, as we computed a little under two thousand runs, this adds up to about 1800 GPU hours, to which one must add around 400 GPU hours of prototyping and hyper-parameter searching. Those 2200 GPU hours would usually have necessitated the release of about 400kg of CO2, about 40% of a transatlantic flight for a single passenger, in the country where we ran the experiments, although we used a carbon-neutral cloud compute provider.
The main benefit of prompting, rather than compute efficiency, is data efficiency. Although we ran all of our experiments on English, we hope that this property will be especially helpful in low-resource language applications. In a sense, a practitioner could then remedy the lack of task-specific data in their language by introducing information through a prompt. However, this comes with the inherent risk of introducing human biases into the model. Prompt completion also suffers from biases already present within the language model (Sheng et al., 2019). This could cause a prompted model to repeat those biases in classification, especially in the few-shot setting where prompting mostly relies on the pretrained model. With the identity function as a verbalizer.

B Influence of the reporting method over runs
We chose to report the accumulated maximum performance across runs for every model. This means that if the maximum performance over random seeds is smaller than a maximum previously attained with less data points, we use the previous value. This appendix presents results with the maximum and mean at every point to condense several runs instead. Using either maximum is equivalent; using the mean, however, can make results vary significantly, as the distribution of outcomes is heavily left-skewed, or even bimodal, with poor-performance outliers.

C Curves on all tasks
Figure 4: Prompting vs head (classifier) performance across data scales, up to the full dataset, for seven Super-GLUE tasks & MNLI. Compares the best prompt and head performance at each level of training data across 4 runs. Highlighted region shows the accuracy difference of the models. Cross-hatch region highlights the lowestand highest-accuracy matched region in the curves. The highlighted area in this region is used to estimate the data advantage from prompting.