Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT and RoBERTa on a series of NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred simultaneously. Through intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning.


Introduction
Finetuning a large pretrained language model like BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019c), and XLNet (Yang et al., 2019) often yields competitive or even state-of-the-art results on NLP benchmarks (Wang et al., 2018(Wang et al., , 2019. Given an NLP task, finetuning stacks a linear layer on top of the pretrained language model and then updates all parameters using SGD. Various aspects like brittleness (Dodge et al., 2020) and adaptiveness (Peters et al., 2019) of this two-stage transfer learning NLP paradigm (Dai and Le, 2015;Howard and Ruder, 2018) have been studied.
Despite the simplicity and impressive performance of finetuning, the prohibitively large number of parameters to be finetuned, e.g., 340 million in BERT-large, is a major obstacle to wider deployment of these models. The large memory footprint of finetuned models becomes more prominent * Equal contribution. when multiple tasks need to be solved simultaneously -several copies of the millions of finetuned parameters have to be saved for inference.
Combining finetuning with multi-task learning (Collobert and Weston, 2008;Ruder, 2017) helps reduce the overall number of required parameters. But multi-task NLP models may produce inferior results compared with their single-task counterparts (Martínez Alonso and Plank, 2017;Bingel and Søgaard, 2017). Solving this problem is non-trivial and more complicated techniques, e.g., knowledge distillation (Hinton et al., 2015;Clark et al., 2019), adding extra modules (Stickland and Murray, 2019), or designing sophisticated task-specific layers (Liu et al., 2019b), may be necessary. In this work, we present a method that efficiently utilizes pretrained language models and potential interferences among tasks are eliminated.
Recent work (Gaier and Ha, 2019; Zhou et al., 2019) points out the potential of searching neural architectures within a fixed model, as an alternative to directly optimizing the model weights for downstream tasks. Inspired by these results, we present a simple yet efficient scheme for utilizing pretrained language models. Instead of directly updating the pretrained parameters, we propose to select weights important to downstream NLP tasks while discarding irrelevant ones. The selection mechanism consists of a set of binary masks, one learned per downstream task through end-to-end training.
We show that masking, when being applied to pretrained language models like BERT and RoBERTa, achieves performance comparable to finetuning in tasks like sequence classification, part-of-speech tagging, and reading comprehension. This is surprising in that a simple subselection mechanism that does not change any weights is competitive with a training regime -finetuning -that can change the value of every single weight.
We conduct detailed analyses revealing factors important for and possible reasons contributing to the good task performance.
Masking is parameter-efficient: only a set of 1bit binary masks needs to be saved per task after training, instead of all 32-bit float parameters in finetuning. This small memory footprint enables deploying pretrained language models for solving multiple tasks on edge devices. The compactness of masking also naturally allows parameter-efficient ensembles of pretrained language models.
Our contributions: (i) We introduce masking, a new scheme for utilizing pretrained language models: learning selective masks of pretrained weights. Masking is an efficient alternative to finetuning. We show that masking is applicable to models like BERT and RoBERTa, and produces performance on par with finetuning. (ii) We carry out extensive empirical analysis of masking, shedding light on factors critical for achieving good performance on a series of NLP tasks. (iii) We study the loss landscape and compute representations of masked language models, revealing potential reasons why masking has task performance comparable to finetuning.

Related Work
Two-stage NLP paradigm. Pretrained language models (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019c;Yang et al., 2019;Radford et al.) advance NLP with contextualized representation of words. Finetuning a pretrained language model (Dai and Le, 2015;Howard and Ruder, 2018) often delivers competitive performance partly because pretraining leads to a better initialization across various downstream tasks than training from scratch (Hao et al., 2019). However, finetuning on individual NLP tasks is not parameter-efficient. Each finetuned model, typically consisting of hundreds of millions of floating point parameters, needs to be saved individually. Stickland and Murray (2019) use projected attention layers with multitask learning to improve efficiency of finetuning BERT. Houlsby et al. (2019) insert adapter modules to BERT for parameter-efficient transfer learning. Since the newly inserted modules alter the forward pass of BERT, they need to be carefully initialized to be close to identity.
In this work, we propose to directly pick parameters appropriate to a downstream task, by learning selective binary masks via end-to-end training.
With the pretrained parameters being untouched, we solve several downstream NLP tasks simultaneously with minimal overhead.
Binary networks and network pruning. The rationale of training binary masks is the "straightthrough estimator" (Bengio et al., 2013;Hinton, 2012). Hubara et al. (2016), Rastegari et al. (2016), Hubara et al. (2017, inter alia, apply this technique to train efficient binarized neural networks. We use this estimator to train selective masks for pretrained language model parameters. Investigating the lottery ticket hypothesis (Frankle and Carbin, 2018) of network pruning (Han et al., 2015a;He et al., 2018;Liu et al., 2019d;Lee et al., 2019;Lin et al., 2020), Zhou et al. (2019) find that applying binary masks to a neural network is a form of training the network. Gaier and Ha (2019) propose to search neural architectures for reinforcement learning and image classification tasks, without any explicit weight training. This work inspires our masking scheme (which can be interpreted as implicit neural architecture search (Liu et al., 2019d)): applying the masks to a pretrained language model is similar to finetuning, yet is much more parameter-efficient.
Perhaps the closest work, Mallya et al. (2018) apply binary masks to CNNs and achieve good performance in computer vision. We learn selective binary masks for pretrained language models in NLP and shed light on factors important for obtaining good performance. Mallya et al. (2018) explicitly update weights in a task-specific classifier layer. In contrast, we show that end-to-end learning of selective masks, for both the pretrained language model and a randomly initialized classifier layer, achieves good performance.

Transformer
The encoder of the transformer architecture (Vaswani et al., 2017) is ubiquitously used when pretraining large language models. We briefly review its architecture and then present our masking scheme. Taking BERT-base as an example, each one of the 12 transformer blocks consists of (i) four linear layers 1 W K , W Q , W V , and W AO , for computing and outputting the self attention among input wordpieces (Wu et al., 2016), (ii) two linear layers W I and W O feeding forward the word representations to the next transformer block.
When finetuning on a downstream task like sequence classification, a linear classifier layer W T projecting from the hidden dimension to the output dimension is randomly initialized. Next, W T is stacked on top of a pretrained linear layer W P (the pooler layer). All parameters are then updated to minimize the task loss such as cross-entropy.

Learning the mask
Given a pretrained language model, we do not finetune, i.e., we do not update the pretrained parameters with SGD-based optimization. Instead, we carefully select a subset of the pretrained parameters that is critical to a downstream task while discarding irrelevant ones. We associate each linear layer W l ∈ {W l K , W l Q , W l V , W l AO , W l I , W l O } of the l-th transformer block with a real-valued matrix M l that is randomly initialized from a uniform distribution and has the same size as W l . We then pass M l through an element-wise thresholding function (Hubara et al., 2016;Mallya et al., 2018), i.e., a binarizer, to obtain a binary mask M l bin for W l : where m l i,j ∈ M l , i, j indicate the coordinates of the 2-D linear layer and τ is a global thresholding hyperparameter.
In each forward pass of training, the binary mask M l bin (derived from M l via Eq. 1) selects weights in a pretrained linear layer W l by Hadamard product: In the corresponding backward pass of training, with the associated loss function L, we cannot backpropagate through the binarizer, since Eq. 1 is a hard thresholding operation and the gradient with respect to M l is zero almost everywhere. Similar to the treatment 2 in Bengio et al.
where η refers to the step size. Hence, the whole structure can be trained end-to-end. We learn a set of binary masks for an NLP task as follows. Recall that each linear layer W l is associated with a M l to obtain a masked linear layerŴ l through Eq. 1. We randomly initialize an additional linear layer with an associated M l and stack it on top of the pretrained language model. We then update each M l through Eq. 2 with the task objective during the training.
After training, we pass each M l through the binarizer to obtain the eventual M l bin , which is then saved for future inference. Since M l bin is binary, it takes only ≈ 3% of the memory compared to saving the 32-bit float parameters in a model computed by finetuning. Additionally, we will show that many layers -in particular the embedding layer -do not have to be masked. This further reduces memory consumption of masking.

Configuration of masking
Our masking scheme is motivated by the observation: the pretrained weights form a good initialization (Hao et al., 2019), yet a few steps of adaptation are still needed to produce competitive performance for a specific task. However, not every pretrained parameter is necessary for achieving reasonable performance, as suggested by the field of neural network pruning (LeCun et al., 1990;Hassibi and Stork, 1993;Han et al., 2015b). We now investigate two configuration choices that affect how many parameters are "eligible" for masking.
Initial sparsity of M l bin . As we randomly initialize our masks from uniform distributions, the sparsity of the binary mask M l bin in the mask initialization phase controls how many pretrained parameters in a layer W l are assumed to be irrelevant to the downstream task. Different initial sparsity rates entail different optimization behaviors.
It is crucial to better understand how the initial sparsity of a mask impacts the training dynamics and final model performance, so as to generalize our masking scheme to broader domains and tasks. In §5.1 we investigate this aspect in detail. In practice, we fix τ while adjusting the uniform distribution to achieve a target initial sparsity.
Which layers to mask. Different layers of pretrained language models capture distinct aspects of a language during pretraining. For example, Tenney et al. (2019) find that information on POS tagging, parsing, NER, semantic roles and coreference is encoded on progressively higher layers of BERT;Jawahar et al. (2019) show that higher transformer layers of BERT better encode longterm dependency than lower layers. It is hard to know a priori which types of NLP tasks have to be addressed in the future, making it non-trivial to decide layers to mask. We study this factor in §5.2.
We do not learn a mask for the lowest embedding layer, i.e., the uncontextualized wordpiece embeddings are completely "selected", for all tasks. The motivation is two-fold. (i) The embedding layer weights take up a large part, e.g., almost 21% (23M/109M) in BERT-base-uncased, of the total number of parameters. Not having to learn a selective mask for this layer reduces memory consumption. (ii) We assume that pretraining has effectively encoded context-independent meanings of words in wordpiece and position embeddings. Hence, learning a selective mask for the embedding layer is unnecessary. In addition, we do not learn masks for biases and layer normalization parameters as we did not observe a positive effect on performance.

Datasets
We present results for masking BERT and RoBERTa in sequence classification, part-ofspeech tagging, and reading comprehension.
We experiment with part-of-speech tagging (POS) on Peen Treebank (Marcus et al., 1993), using Collins (2002)'s train/dev/test split. We experiment with reading comprehension on SWAG (Zellers et al., 2018) using the official data splits.
We report Matthew's correlation coefficient (MCC) for CoLA and accuracy for the other tasks.

Setup
Due to resource limitations and in the spirit of environmental responsibility (Strubell et al., 2019;Schwartz et al., 2019), we conduct our main experiments on the base models: BERT-base-uncased and RoBERTa-base. We implement 3 our models in PyTorch Paszke et al., 2019). Throughout all experiments, we limit the maximum length of a sentence (pair) to be 128 after wordpiece tokenization. Following Devlin et al.
(2019), we use the Adam (Kingma and Ba, 2014) optimizer of which the learning rate is a hyperparameter while the other parameters remain default. We carefully tune the learning rate for each setup: the tuning procedure ensures that the best learning rate does not lie on the border of our search grid, otherwise we extend the grid accordingly. The initial grid is {1e-5, 3e-5, 5e-5, 7e-5, 9e-5}.
For sequence classification and reading comprehension, we use [CLS] as the representation of the sentence (pair). For POS experiments, the representation of a tokenized word is its last wordpiece (Liu et al., 2019a; He and Choi, 2019).

Initial sparsity of binary masks
We first investigate how initial sparsity percentage (i.e., fraction of zeros) of the binary mask M l bin influences performance of a masked language model on downstream tasks. We experiment on four tasks, with initial sparsities in {1%, 3%, 5%, 10%, 15%, 20%, . . . , 95%}. All other hyperparameters are controlled: learning rate is fixed to 5e-5; batch size is 32 for relatively small datasets (RTE, MRPC, and CoLA) and 128 for SST2. Each experiment is repeated four times with different random seeds. In this experiment, all transformer blocks, the pooler layer, and the classifier layer are masked. Figure 1 shows that masking achieves decent performance without hyperparameter search. Specifically, (i) a large initial sparsity removing most pretrained parameters, e.g., 95%, leads to bad performance for the four tasks. This is due to the fact that the pretrained knowledge is largely discarded when most of the connections are removed. (ii) Gradually decreasing the initial sparsity improves task performance. Generally, an initial sparsity in 3% ∼ 10% yields reasonable results across tasks. Large datasets like SST2 (67k) are less sensitive than small datasets like RTE (2.5k). (iii) Selecting almost all pretrained parameters, e.g., 1% sparsity, hurts task performance. Recall that a pretrained model needs to be adapted to a downstream task; masking achieves adaptation by learning selective masks -preserving too many pretrained parameters in initialization impedes the optimization. Also, the classifier layer is randomly initialized then masked -selecting too many randomly initialized parameters from this layer similarly makes optimization difficult.

Layer-wise behaviors
Neural network layers present heterogeneous characteristics (Zhang et al., 2019) when being applied to tasks. For example, syntactic information is better represented at lower layers while semantic information is captured at higher layers in ELMo (Peters et al., 2018). As a result, simply masking all transformer blocks (as in §5.1) may not be ideal. It is possible that removing too many pretrained parameters in lower layers hurts upper layers when creating high quality representations.
We investigate layer-wise behaviors of BERT layers when using the masking scheme. Specifically, in Figure 2, we observe the sparsity change, i.e., final sparsity percentage minus initial sparsity percentage, of a masked layerŴ l ∈ {Ŵ l K ,Ŵ l Q ,Ŵ l V ,Ŵ l AO ,Ŵ l I ,Ŵ l O } of the l-th transformer block, the masked pooler layerŴ P , and the masked classifier layerŴ T .
We run our masking scheme on BERT for 10 epochs for POS and SST2. Initial sparsity of M l bin of all masked layers is purposely set to a high value of 50% to encourage strong effects. Such a high initial sparsity leads to inferior but better than baseline task performance: 0.866 accuracy for SST2 and 0.970 for POS on dev. Figure 2 presents the layer-wise sparsity change of the l-th transformer blocks on POS and SST2. Sparsity change ofŴ P andŴ T is 0.00% and -9.86% for POS; and -0.04% and -1.69% for SST2. Observations are: (i) most sparsity changes are negative, meaning that the learning objective encourages M l bin to be less sparse than the large initial sparsity 50%. For POS, the top layer sparsities increase, reflecting that the encoded abstract semantic information is not helpful for solving this syntactic task. (ii) Sparsity decreases in the lower layers of masked BERT are ≈ 3 times larger for POS than for SST2 (sentiment classification). This is consistent with previous studies showing that syntactic information is better encoded in lower layers (Peters et al., 2018).Ŵ T sees the largest sparsity change: -9.86% for POS and -1.69% for SST2. We conjecture that the randomly initialized weights in this layer require more adaptations than pretrained layers to fit a downstream task. Figure 3 presents the optimal task performance when masking only subset of BERT's transformer blocks on MRPC, CoLA and RTE. We see that (i) in most cases, top-down masking outperforms bottom-up masking when initial sparsity and the number of masked layers are fixed. Thus, it is reasonable to select all pretrained weights in lower layers, since they capture general information helpful and transferable to various tasks (Liu et al., 2019a;Howard and Ruder, 2018 lustrates dependencies between BERT layers and the learning dynamics of masking: provided with selected pretrained weights in lower layers, higher layers need to be given flexibility to select pretrained weights accordingly to achieve good task performance. (iii) In top-down masking, CoLA performance increases when masking a growing number of layers while MRPC and RTE are not sensitive. Recall that CoLA tests linguistic acceptability that typically requires both syntactic and semantic information. 4 All of BERT layers are involved in representing this information, hence allowing more layers to change should improve performance.

Comparing finetuning and masking
We have thoroughly investigated two factors -initial sparsity ( §5.1) and layer-wise behavior ( §5.2)that are important in masking pretrained language models. Here, we compare the performance of masking and finetuning on a series of NLP tasks.
We search the optimal learning rate per task as described in §4.2. We use a batch size of 32 for tasks that have <96k training examples. For AG, 4 For example, to distinguish acceptable caused-motion constructions (e.g., "the professor talked us into a stupor") from inacceptable ones (e.g., "the hall talked us into a series" both syntactic and semantic information needs to be considered (Goldberg, 1995). QNLI, and SWAG, we use batch size 128. The optimal hyperparameters per task are shown in §A.1. Table 1 compares performance of masking and finetuning on the dev set for GLUE tasks, POS, and SWAG. We observe that applying masking to BERT and RoBERTa yields performance comparable to finetuning. We observe a performance drop 5 on RoBERTa-RTE. RTE has the smallest dataset size (2.5k in train and 0.3k in dev) among all tasks -this may contribute to the imperfect results and large performance variances.
Rows "Single" in Table 2 compare performance of masking and finetuning BERT on the test set of SEM, TREC-6 and AG. The same setup and hyperparameter searching of masking BERT as Table 1 are used, the best hyperparameters are picked on the dev set. Results from Sun et al. (2019) are included as a reference. Note that Sun et al. (2019) use configurations like layer-wise learning rate, producing slightly better performance than ours. Palogiannidi et al. (2016) is the best performing systems on task SEM (Nakov et al., 2016). Again, masking yields results comparable to finetuning.
Next, we compare ensembled results to better demonstrate memory efficiency of the masking  scheme. Three ensemble methods are considered: (i) majority voting where the most voted label is the final prediction; (ii) ensemble of logits where the label with the highest overall logit is the final prediction; (iii) ensemble of probability where the label with the highest overall predicted probability is the final prediction. The best ensemble method is picked on dev then evaluated on test. Rows "Ensem." in Table 2 compare ensembled results and model size. Masking consumes only 474MB memory -much smaller than 1752MB required by finetuning -and achieves comparable performance. Thus, masking is much more memory-efficient than finetuning in an ensemble setting.

Intrinsic evaluations
Through extensive evaluations on a series of NLP tasks, §5 demonstrates that the masking scheme is an efficient alternative to finetuning. Now we analyze properties of the representations computed by masked BERT/RoBERTa with intrinsic evaluation. One intriguing property of finetuning, i.e., stacking a classifier layer on top of a pretrained language model then update all parameters, is that a linear classifier layer suffices to conduct reasonably accurate classification. This observation implies that the configuration of data points, e.g., sentences with positive or negative sentiment in SST2, should be close to linearly separable in the hidden space. Like finetuning, masking also uses a linear classifier layer. Hence, we hypothesize that upper layers in masked BERT/RoBERTa, even without explicit weight updating, also create a hidden space in which data points are close to linearly separable.  are clearly not distinguishable since the pretrained models need adaptations to downstream tasks. After applying the masking scheme, the computed representations are almost linearly separable and consistent with the gold labels. Thus, a linear classifier is expected to yield reasonably good classification accuracy. All t-SNE visualizations of pretrained, finetuned, and masked BERT/RoBERTa are presented in §A.3. This intrinsic evaluation illustrates that masked BERT/RoBERTa extracts good representations from the data for the downstream NLP task.
6.2 Do the masked models generalize? Figure 4 illustrates that a masked language model extracts proper text representations for the classifier layer and hence performs as well as finetuning.
Here, we are interested in verifying that our masked language model does indeed solve downstream tasks by learning meaningful representations -in-stead of simply exploiting superficial noise within a dataset. To this end, we test if the masked language model is generalizable to other datasets of one type of downstream task. We use the two sentiment classification datasets in our task pool: SST2 and SEM. We simply evaluate the model masked or finetuned on SST2 against the dev set of SEM and vice versa. Table 3 reports the results. For example, in (a), cell value -13.4 means that the SST2-masked BERT performs 13.4% worse than the majority baseline 6 on dev set of SEM.
Comparing SST2 and SEM, we can observe that knowledge learned on SST2 does not generalize to SEM, for both finetuning and masking. Note that the Twitter domain (SEM) is much more specific than movie reviews (SST2). For example, some Emojis or symbols like ":)" reflecting strong sentiment do not occur in SST2, resulting in unsuccessful generalization. On the other hand, the finetuned and masked models of SEM generalize well on SST2, showing ≈ 20% improvement against the majority baseline. Thus, the masked models indeed create representations that contain valid information for downstream tasks.

Loss landscape
Training complex neural networks can be viewed as searching for good minima in the highly nonconvex landscape defined by the loss function (Li et al., 2018). Good minima are typically depicted as points at the bottom of different locally convex valleys (Keskar et al., 2016;Draxler et al., 2018), achieving similar performance. In this section, we study, for BERT and RoBERTa, the relationship between the two minima obtained by masking and finetuning.
Recent work analyzing the loss landscape suggests the local minima reached in the loss landscape of vanilla training can be connected by a simple path (Garipov et al., 2018;Gotmare et al., 2018), e.g., a Bézier curve, with low task loss (or high task accuracy) along the path. We are interested in testing if the two minima found by finetuning and masking can be easily connected on the loss landscape. To start with, we verify the task performance of an interpolated model W(γ) on the line segment between a finetuned model W 0 and a We conduct experiments on MRPC and SST2 with the best-performing BERT and RoBERTa models obtained in Table 1 (same seed and training epochs); Figure 5 shows the results of mode connectivity, i.e., the evolution of the loss value along a line connecting two candidate minima. Surprisingly, the interpolated models on the line segment connecting a finetuned and a masked model form a high accuracy path 7 , indicating the extremely well-connected loss landscape. Thus, the masking scheme finds minima on the same connected low-loss manifold as finetuning, confirming the effectiveness of our method. We also experimented with Bézier curves as proposed by Garipov et al. (2018); similar observations can be made, as shown in §A.4.

Conclusion
We have presented masking, a new way of utilizing pretrained language models that is more efficient than finetuning. Instead of directly modifying the pretrained parameters through additional taskspecific training (as in finetuning), we only train one binary mask per task in order to select critical parameters. Extensive experiments show that masking yields performance comparable to traditional finetuning on a series of NLP tasks. Leaving the pretrained language model parameters unchanged, masking is much more parameter and memory efficient when several tasks need to be solved simultaneously. Intrinsic evaluations show that masked language models extract valid representations for downstream tasks. We further show that the representations are generalizable. Moreover, we demonstrate that the minima obtained by finetuning and masking can be easily connected by a line segment, further confirming the effectiveness of applying masking to pretrained language models.

References
We perform finetuning/masking on all tasks for 10 epochs with early stoping of 2 epochs.

A.3 T-SNE visualizations
We train an 3-bends Bézier curve by minimizing the loss E t∼U [0,1] L (φ θ (t)), where U [0, 1] is the uniform distribution in the interval [0, 1]. Monte Carlo method is used to estimate the gradient of this expectation-based function and gradient-based optimization is used for the minimization. The results are illustrated in Figure 9. Masking implicitly performs gradient descent, analogy to the weights update achieved by finetuning; the observations complement our arguments in the main text.
In addition, Figure 10 visualize the line segment between a pretrained language model and a finetuned or masked model (on downstream task), highlighting the present observations of the mode connectivity are not solely due to the overparameterized pretrained language model. Figure 10: The accuracy on the dev set, as a function of the point on the curves φ θ (γ), connecting the two models between pretrained language model (left, γ = 0), and finetuned or masked model (right, γ = 1) on a downstream task.