Structured Pruning of Large Language Models

Ziheng Wang, Jeremy Wohlwend, Tao Lei


Abstract
Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a generic, structured pruning approach by parameterizing each weight matrix using its low-rank factorization, and adaptively removing rank-1 components during training. On language modeling tasks, our structured approach outperforms other unstructured and block-structured pruning baselines at various compression levels, while achieving significant speedups during both training and inference. We also demonstrate that our method can be applied to pruning adaptive word embeddings in large language models, and to pruning the BERT model on several downstream fine-tuning classification benchmarks.
Anthology ID:
2020.emnlp-main.496
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Editors:
Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6151–6162
Language:
URL:
https://aclanthology.org/2020.emnlp-main.496
DOI:
10.18653/v1/2020.emnlp-main.496
Bibkey:
Cite (ACL):
Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2020. Structured Pruning of Large Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6151–6162, Online. Association for Computational Linguistics.
Cite (Informal):
Structured Pruning of Large Language Models (Wang et al., EMNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.496.pdf
Video:
 https://slideslive.com/38939265
Code
 asappresearch/flop +  additional community code