General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference

The state of the art on many NLP tasks is currently achieved by large pre-trained language models, which require a considerable amount of computation. We aim to reduce the inference cost in a setting where many different predictions are made on a single piece of text. In that case, computational cost during inference can be amortized over the different predictions (tasks) using a shared text encoder. We compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks. We also compare ways of extracting fixed- and limited-size representations from this encoder, including pooling features extracted from multiple layers or positions. Our best approach compares favorably to knowledge distillation, achieving higher accuracy and lower computational cost once the system is handling around 7 tasks. Further, we show that through binary quantization, we can reduce the size of the extracted representations by a factor of 16 to store them for later use. The resulting method offers a compelling solution for using large-scale pre-trained models at a fraction of the computational cost when multiple tasks are performed on the same text.


Introduction
Large pre-trained language models achieve stateof-the-art performance on many Natural Language Processing (NLP) tasks (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019). However, inference for these models requires significant computational resources, which limits their practical use. Recent trends show that scaling models up (Liu et al., 2019b;Lan et al., 2020;Raffel et al., 2019;Li et al., 2020) in terms of computation still improves end task performance, raising questions * Equal contribution. about whether and how the most accurate models can be applied in real-world settings.
This computational burden is exacerbated by the need to fine-tune end-to-end a separate model for each task. Since each model has a new set of parameters, none of the computation can be shared by models for different tasks during inference. This is particularly inefficient in real-world settings that require multiple predictions about each input. For example, given a news article, we may want to predict its topic (Zhang et al., 2015), sentiment (Pang and Lee, 2004;Maas et al., 2011;Socher et al., 2013;Zhang et al., 2015), overall text quality (Pitler and Nenkova, 2008), whether it is humorous (Yang et al., 2015) or offensive (Schmidt and Wiegand, 2017;Zampieri et al., 2019) and so on.
Knowledge Distillation (KD) is a way of reducing the computation required by large pre-trained LMs (Hinton et al., 2015;Sanh et al., 2019). However, there is a sizeable gap in accuracy between the best models using knowledge distillation and the full fine-tuned models. Another way of speeding up computation is through system optimizations such as quantization and operator fusion (Zafrir et al., 2019). These techniques can reduce the amount of computation significantly, but may not be sufficient by themselves and can be combined with the methods we discuss.
In this paper we look at new ways to make inference computationally efficient focusing on the case where different models (models for different tasks) are run over the same piece of text. We propose new methods to run multiple task-specific models in a way that amortizes the computation over the different tasks. The central idea is to compute the activations for the full model once and use smaller task-specific models on top of it. We explore three possible ways for sharing computation.
The first solution is inspired by work on general purpose text encoders (Kiros et al., 2015 (c) Leave-one-task-out finetuning. Figure 1: An illustration of the finetuning approaches explored in this work. (a) In single-task finetuning, an encoder model is fine-tuned end-to-end for a given task. (b) In multi-task pre-training, an encoder model is jointly trained over k − 1 tasks, each with their own classification head. (c) In leave-one-task-out finetuning, a multi-task encoder is frozen and used to extract features for an unseen (k th ) task. Following Peters et al. (2019), we use and to denote components that are fine-tuned for each task or frozen, respectively. Conneau et al., 2017;Subramanian et al., 2018), which produce fixed-size representations (i.e., sentence embeddings) that can be shared across tasks. We add only small task-specific layers on top of these fixed-size representations. Unfortunately, when evaluated on unseen tasks, we find that models that rely on fixed-size representations often underperform single-task baselines by a large margin, in agreement with past work (Subramanian et al., 2018;Peters et al., 2019;Raffel et al., 2019;Wang et al., 2019a).
The second solution is a multi-task system (Caruana, 1997;Collobert and Weston, 2008;Ruder, 2017), where a single model is jointly trained to handle many tasks (see Figure 1b). If most layers are shared, the overall inference cost can be nearly k times less than for k separate single-task models, while providing competitive task accuracy (Liu et al., 2019a;Raffel et al., 2019;Wang et al., 2019a). However, multi-task systems work best when the set of tasks is known in advance. Adding new tasks requires retraining the multi-task model and re-incurring training costs, thus limiting the utility of this approach in real-world systems where new classification tasks may be introduced periodically.
We propose a third solution: a multi-task encoder that is shared across tasks and produces limited-size representations that grow with the length of the input, similar to contextualized word representations (Peters et al., 2018). We evaluate our representations on 14 text classification tasks using a leave-one-task-out evaluation protocol (see Figure 1c), where a multi-task encoder model is trained on k − 1 tasks, frozen and used as a static feature extractor for an unseen k th task. 1 We find an important ingredient to performing well on an unseen (k th ) task is to extract features from multiple layers and positions of the encoder. Ultimately, our general purpose encoders offer a better tradeoff between task accuracy and inference cost than either fixed-size representations or distilled models, while requiring minimal additional computation to handle new tasks.
We also consider the case in which not all of the predictions can be done at the same time and intermediate representations have to saved. In that context, we study the relationship between representation size and end-task performance. We find that features extracted by our encoders are amenable to heavy quantization enabling a 16x reduction in the size of the extracted features with negligible impact on unseen task performance.

Related Work
Self-supervised pre-training, typically through language modeling, has advanced the state of the art for many NLP tasks (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019). There are two dominant ways of adapting pre-trained models to downstream tasks: (1) finetuning, which often results in the best accuracy (Devlin et al., 2019); and (2) feature extraction, which can be significantly more efficient during inference when there are multiple end tasks. Peters et al. (2019) compare these and find finetuning outperforms feature extraction for BERT; however, they use features immediately after pre-training, whereas we also consider features after multi-task finetuning.
Multi-task learning (MTL) has a rich history in machine learning (Caruana, 1997;Ruder, 2017) and NLP (Collobert and Weston, 2008;Luong et al., 2016). Multi-task models can potentially leverage similarities across tasks to achieve higher endtask accuracy than single-task models (Clark et al., 2019;Liu et al., 2019a;Phang et al., 2018;Wang et al., 2019a). Compared to single-task models, a multi-task model can also be more efficient during inference by sharing computation across tasks. Most work in multi-task learning assumes that the set of end-tasks is fixed and known in advance and training is performed for all tasks together. This setup can present challenges in the real world where tasks may require different retraining schedules and new tasks may be frequently added or removed.
General purpose text encoders are usually pre-trained with a mix of supervised and selfsupervised training objectives and produce fixedsize representations (Kiros et al., 2015;Hill et al., 2016;Conneau et al., 2017;Subramanian et al., 2018). Unlike multi-task learning, general purpose text encoders are typically evaluated on unseen tasks, which is more representative of real-world settings in which new tasks may be added periodically. Unfortunately, these approaches often underperform single-task baselines (McCann et al., 2018;Liu et al., 2019a;Wang et al., 2019a).
Another line of work has explored adapting pretrained models by adding additional task-specific capacity at each layer (Houlsby et al., 2019), however these methods do not improve inference efficiency since there is no task-independent computation that can be shared across tasks.
Knowledge Distillation (Buciluǎ et al., 2006;Hinton et al., 2015) is a technique where a more efficient student model is trained to mimic the behaviour of a larger or ensembled teacher model. A knowledge distilled version of BERT (Sanh et al., 2019) has been proposed to reduce the computation required by large pre-trained language models. DistilRoBERTa reaches 95% of RoBERTa-base's performance on GLUE while being twice faster.
Quantization and other compression techniques have been explored for word embeddings (Shu and Nakayama, 2017;Tissier et al., 2019) and sentence embeddings (Shen et al., 2019). Recent work has also explored quantization for contextualized word representations, generally showing that quantization-aware training is necessary to achieve reasonable end task performance (Zafrir

Experimental Setup
Our goal is to develop text encoders that produce representations which achieve high accuracy for multiple task with little task-specific processing. We first introduce our tasks, encoder models and finetuning framework.

Tasks
We consider 14 text classification tasks, spanning sentiment analysis (SA), natural language inference (NLI), paraphrase identification (PP), document categorization (DOC) and linguistic acceptability (LA). Tasks are chosen for their diversity and usage in recent related work, ensuring that our baselines are representative of the state of the art. Details about each task are given in Table 1. The SA, DOC and LA tasks consist of making predictions about a single text input, while NLI and PP tasks require classifying a pair of text inputs. For pair tasks we concatenate the text with a special separator token following Liu et al. (2019b). Since many of our tasks are part of evaluation benchmarks such as GLUE (Wang et al., 2019b) and the test sets are not publicly available, we report accuracy on the corresponding development sets.

Encoder models
Our encoder models are based on RoBERTa (Liu et al., 2019b), an optimized version of BERT (Devlin et al., 2019) that achieves competitive performance on most of the tasks considered in this work. We primarily use the public RoBERTa LARGE model consisting of 24 Transformer layers (Vaswani et al., 2017), 1024 dimensional representations and 355M parameters. We refer the reader to Devlin et al. (2019) for more details about the BERT architecture and Liu et al. (2019b) for more details about RoBERTa.
We also consider a Knowledge Distilled (KD) version of RoBERTa called DistilRoBERTa (Sanh et al., 2019), which consists of 6 Transformer layers, 768-dim representations and 82M parameters. The distilled model contains 1/4 as many parameters and requires 1/7 as much computation (FLOPs) as the full model. We present a more detailed comparison of the computational requirements for these encoder models in Section 6.5.

Finetuning
We consider two methods for finetuning encoder models, illustrated in Figure 1. Finetuning hyperparameters and other methodological details are given in the Appendix.

Single-task finetuning
Single-task finetuning is the most common way of adapting pre-trained language models to a given task (see Figure 1a). When applied to large pretrained models (e.g., RoBERTa) it often results in the best end-task accuracy, but requires the full model to be run for every task and thus has the highest inference costs for a set of k tasks. Computation can be reduced by using a smaller pre-trained models-including knowledge distilled models (e.g., DistilRoBERTa).
Single-task finetuning serves as an upper bound. Our goal is to achieve similar accuracy as large single-task models with reduced inference costs.

Leave-one-task-out finetuning
We also consider leave-one-task-out finetuning, illustrated in Figures 1b and 1c. We pre-train a multitask encoder on k − 1 tasks and extract frozen features for a k th task. Freezing the encoder allows us to amortize the inference cost over tasks. The leave-one-task-out setup allows us to evaluate generalization on tasks unseen in the training of the encoder. This replicates the real-world setting of adding new tasks to an existing frozen encoder. Leave-one-task-out finetuning has two stages: 1. Multi-task pre-training: We train a single model end-to-end over k − 1 tasks (Figure 1b). The majority of the encoder weights are shared across tasks, except for a classification head (see Section 3.4) that is unique to each task.
It is important for the multi-task model to properly weight different tasks, so that larger tasks do not dominate smaller ones (Raffel et al., 2019;Wang et al., 2019a). We adopt a loss-reweighting technique inspired by Raffel et al. (2019). At each step, we sample a batch of data for every task and update our model according to a sum of the losses, where D i is the number of training examples for task i and T is a temperature controlling weight uniformity. When T = 1, task weights are proportional to data size, and as T → 0, task weights become uniform. We use a fixed temperature of T = 0.1, which performed best in early experiments.
2. Leave-one-task-out finetuning: In the second stage, we freeze the multi-task encoder's weights and use it as a feature extractor for an unseen k th task (see Figure 1c). The extracted features are fed to a new, randomly initialized classification head, which is fine-tuned over the training data for the k th task. We repeat this process k times, with each task held out once, and report the corresponding held-out task performance.

Classification heads
Each task has a classification head that takes features as input and makes a prediction. While related work uses task-specific classification layers (Peters et al., 2018(Peters et al., , 2019Liu et al., 2019a), we adopt a unified architecture for all tasks. We follow the original BERT setup (Devlin et al., 2019) and use a two-layer Multi-Layer Perceptron (MLP) with inner dimension equal to the pooled feature dimension and a tanh activation function. The classification head is always fine-tuned for the end task.

Feature extraction and pooling
A common way to extract features from BERTlike models is to take the representation in the last Transformer layer corresponding to a special CLS token prepended to the input sequence (Devlin et al., 2019). Recent work has also explored extracting features from every position and layer, then linearly combining the layers with task-specific weights (Peters et al., 2019;Tenney et al., 2019).
We propose a more general framework for extracting features, shown in Figure 2. We extract features from several layers of the encoder and then pool them, first across layers and then across positions, before feeding them to a task-specific classification head. This framework subsumes both the CLS token and weighted layer combination approaches. We consider several ways of layer-wise pooling and position-wise pooling: Layer-wise pooling approaches: • LAST-LAYER: only use the last layer. This setting is used by Devlin et al. (2019).
• LAYER-AVG: average the last m layers. We tune m for each setting, but find that m = 16 works best in most cases. Position-wise pooling approaches: • CLS: extract features from the first position. This setting is used by Devlin et al. (2019).
• POSITION-AVG: average features across positions.  • MHA: pool features with a task-specific Multi-Head Attention (MHA) layer (Devlin et al., 2019). We learn a task-specific query and use features as the keys and values (see Figure 3).

Storage Considerations and Quantization
In a real-world settings it may be necessary to store extracted features for later use, such as when new tasks are introduced that require "backfilling" classifications for older content . Storage costs quickly become impractical when pooling over multiple hidden layers and positions, with some approaches (Section 4) requiring features from every layer and position in the encoder. For RoBERTa LARGE , with 24 layers and 1024 dimension representations, a 50 token input would thus emit 50 * 24 * 1024 half-precision floating point numbers and require 2.3MB of storage! We consider quantization methods, described below, for reducing the storage of extracted features. With quantization, we replace floating point numbers with alternative representation formats that have reduced bit width. We will show in Section 6 that extracted features are surprisingly robust: they show little degredation in end-task accuracy even with binary quantization. Recent work has made similar observations in the context of 8-bit integer quantization for BERT model weights and activations (Zafrir et al., 2019).
We explore both 8-bit (uint8) and 1-bit (boolean) quantization of extracted features (see Table 2). We apply quantization prior to leave-one-task-out finetuning (Section 3.3.2) to simulate a real-world setting in which only quantized features are available. For 8-bit quantization, we use PyTorch (Paszke et al., 2019) to learn scale and zero-point parameters which map floating point numbers to the range 0-255. For 1-bit quantization, we apply the sign function to binarize each feature dimension. Table 3 presents our main results for the 14 tasks introduced in Section 3.1. Detailed results of all tasks are included in Table 5 in the Appendix.

Baselines
Table 3 (a) shows results for models fine-tuned endto-end on a single task. This approach yields the best end-task accuracy but has the highest inference costs (see discussion in Section 3.3.1).
We observe that DistilRoBERTa achieves competitive accuracy across many tasks with only 1/4 as many parameters and 1/7 of the computation of the full RoBERTa model. Multi-task pretraining (see Section 3.3.2) prior to single-task finetuning improves results with an average gain of +0.2%. This is consistent with recent work (Liu et al., 2019a;Wang et al., 2019a), but somewhat at odds with the findings of Raffel et al. (2019), who report slightly worse performance with multi-task pre-training. It remains an open question under what conditions multi-task pre-training improves end task accuracy for single-task models.
6.2 Feature extraction and pooling 6.2.1 Without multi-task pre-training Table 3 (b) shows results for single-task models with a frozen encoder and fine-tuned classification head. We observe that freezing the pre-trained RoBERTa model and extracting features from the last layer's CLS token performs poorly, with a 15% drop in accuracy compared to the end-to-end finetuned version (90.5% → 75.5%). This is expected, since the CLS token is not heavily used in the RoBERTa pre-training process (Liu et al., 2019b). 2 If we instead average features across positions in the last layer, we see slightly higher accuracy compared to the CLS token alone (77.7% vs. 75.5%), while our multi-head attention (MHA) pooling further improves accuracy to 83.3%, confirming the importance of task-specific position-wise pooling.
We next consider different layer-wise pooling strategies, still using the MHA position-wise pooling. Taking a simple average over the top 16 layers improves accuracy by +2.2% compared to using just the last layer (85.5% vs. 83.3%). If we instead learn a task-specific weighted combination of layers, similar to Peters et al. (2019), we gain an additional +0.1% compared to using a simple average. However, using a task-specific combination of layers introduces significant storage costs (see Table 2), thus we focus on the LAYER-AVG pooling approach in the rest of our experiments. Table 3 (c) presents results for leave-one-task-out multi-task pre-training (Section 3.3.2), in which the encoder is fine-tuned on k − 1 tasks, then frozen.

With multi-task pre-training
In this setting, the last layer's CLS token now encodes general task information, achieving a higher average accuracy than any of the frozen encoders which did not have leave-one-task-out multi-task pre-training (85.9% vs. 85.6%). As before, our multi-head attention (MHA) position-wise pooling strategy performs best, outperforming the CLS approach by +0.9% and the POSITION-AVG strategy by +0.8%. Layer-wise pooling across multiple layers provides an additional 1.6-1.7% gain. Table 3 (c) also shows the effect of feature quantization on task accuracy. We quantize extracted features after leave-one-task-out multi-task pretraining and use LAYER-AVG / MHA pooling, which offers the best balance between storage efficiency and accuracy. In early experiments, we considered whether to quantize before or after layer-wise pooling and found that quantization before layer-wise pooling was slightly better for 1-bit quantization and had no impact on 8-bit quantization.

Quantization
We observe no performance loss with 8-bit quantization. Surprisingly, 1-bit quantization only reduces accuracy by 0.4%, still outperforming distillation-based methods (88.0% vs. 87.1%), while reducing storage costs by a factor of 16 (to 1024 bits per token; see Table 2).
To understand why quantization works so well, we use a word-sense disambiguation (WSD) task to probe if semantic information encoded in the original and quantized features is preserved. Following Peters et al. (2018), we apply a nearest neighbor classifier over word sense centroids, obtained by averaging features for each word sense over training data. We use the data and splits from Reif et al. (2019). We extract features from the 16th layer of the multi-task RoBERTa encoder (Table 3 (e)), which performed best in pilot experiments, and compare them before and after 1-bit quantization.
The F1 scores shown in Table 4 show that RoBERTa features achieve similar results before   Table 3: Results on 14 tasks, grouped by task type (see Section 3.1). We consider different layer-wise and positionwise pooling strategies introduced in Section 4. We also report the estimated inference cost for 14 tasks (in G FLOPs) for each strategy. Bold results indicate the most accurate method in each section. BERT results are from  and Sun et al. (2019). XLNet results are from . DistilRoBERTa and RoBERTa results are recomputed ourselves. Full results for each task is given in the Appendix. (*) we recomputed accuracy for CoLA, since BERT and XLNet originally reported a different metric.
baseline (most freq. sense) 64.8 -BERT (Reif et al., 2019) 71.1 -RoBERTa 71.2 71.1 and after 1-bit quantization. Both results are comparable to those from Reif et al. (2019), confirming that 1-bit quantization at least preserves word sense information in the extracted features.

Generalization
We used the leave-one-task-out setting to evaluate generalization to unseen tasks. We now consider the case where an entire task type (see Section 3.1) is held out during multi-task pre-training. For example, we pre-train an encoder over non-NLI tasks and evaluate the frozen features on NLI tasks. Results presented in the fourth section (d) of Table 3 show that performance drops considerably from the corresponding leave-one-task-out setting (  (Phang et al., 2018). Thus, it is important to pre-train the encoder over a variety of task types to maximize generalization to new tasks. Another alternative to leaving tasks out is to pretrain the encoder over all k tasks and evaluate it on each task without additional finetuning. This setting is useful when the set of all tasks is known in advance and does not change. Results (in the final section of Table 3 (e)) show that when models are part of the multi-task finetuning they perform 3.4% better on average as opposed to when they are held out (89.3% vs. 85.9%). Table 3 reports cumulative inference cost (over 14 tasks) for each method. Single-task finetuning is the most accurate and the least efficient approach. Approaches using knowledge distillation and frozen encoders reduce FLOPs by an order of magnitude. Figure 4 shows the number of FLOPs required for inference as a function of the number of tasks performed on the same text. While single-task finetuning of the full model is never efficient, distilled The cost for single-task models grows linearly with the number of tasks, whereas approaches based on a frozen encoder are much more efficient. Distilled models are particularly efficient when the number of tasks is small, but the cost scales linearly and becomes less efficient than a frozen encoder when the number of tasks T > 7.

Computational cost during inference
models are the most efficient for systems with 7 or fewer tasks. Frozen encoder approaches become the most efficient option when more than 7 tasks are prerformed on the same piece of text.

Conclusion
We study how to improve the efficiency of largescale pre-trained models so that can be used in practical settings. We show that when several tasks are performed on a single piece of text, the computation can be effectively amortized reducing the amount of computation per task. Compared to distillation, the shared computation method achieves higher accuracy and reduces computational cost after 7 tasks need to be performed on the same piece of text. We show that the shared features can be quantized with very little loss in accuracy, which means that the intermediate computation can be stored for later use. In total, the techniques that we present provide a compelling solution for running large-scale pre-trained models in applications where multiple predictions are made on the same piece of text.

A Finetuning Methodology
We largely adopt the finetuning procedure and hyperparameters from Liu et al. (2019b). We use the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, = 1e − 6. We search over learning rates ∈ {1,2,3}e-5 and batch sizes ∈ {16,32} for each task. We finetune for 10 epochs 3 and report the best dev set accuracy for each task, which we measure at epoch boundaries. We linearly warm up the learning rate for the first 6% of finetuning updates and linearly decay the rate to 0 for the remaining updates. We apply dropout with p = 0.1 and finetune with weight decay of 0.01. For finetuning the multi-task encoders, we use a learning rate of 1e-5 and batches consisting of 4 samples from each task. For example, a leaveone-task-out encoder for MNLI would be finetuned on batches containing 4 samples from each of the (13) non-MNLI tasks, for a total batch size of 52. As described in Section 3.3.1, the task losses are weighted with a mixing temperature α = 0.1 and summed, following Raffel et al. (2019).

B Detailed Results
The results for all setups over 14 tasks can be found in Table 5 Table 3 for more details.