Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, Hinrich Schütze


Abstract
We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT, RoBERTa, and DistilBERT on eleven diverse NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred. Intrinsic evaluations show that representations computed by our binary masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning.
Anthology ID:
2020.emnlp-main.174
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Editors:
Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2226–2241
Language:
URL:
https://aclanthology.org/2020.emnlp-main.174
DOI:
10.18653/v1/2020.emnlp-main.174
Bibkey:
Cite (ACL):
Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, and Hinrich Schütze. 2020. Masking as an Efficient Alternative to Finetuning for Pretrained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2226–2241, Online. Association for Computational Linguistics.
Cite (Informal):
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models (Zhao et al., EMNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.174.pdf
Video:
 https://slideslive.com/38938867
Data
CoLAGLUEMRPCQNLISSTSST-2SWAG