BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

Authors: Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLa MA1, and LLa MA2 with 7B to 70B parameters on a single A100 GPU in just five hours. ... In this section, we present a comprehensive series of experiments designed to evaluate the effectiveness of our proposed methods. We begin by providing a detailed overview of our experiment settings, encompassing the configuration of our experiments, the specific Large Language Model (LLM) model under evaluation, the benchmark dataset utilized, and the baseline method employed for comparison.
Researcher Affiliation Collaboration Peng Xu1,2 Wenqi Shao*2 Mengzhao Chen2 Shitao Tang Kaipeng Zhang2 Peng Gao2 Fengwei An3 Yu Qiao2 Ping Luo*1,2 1 The University of Hong Kong 2 Open GVLab, Shanghai AI Laboratory 3 Southern University of Science and Technology
Pseudocode Yes Algorithm 1 Overall algorithm of BESA.
Open Source Code No Code is available at here.
Open Datasets Yes We primarily measure model perplexity using the Wiki Text2 (Merity, 2016), C4 (Raffel et al., 2020), and PTB (Marcus et al., 1994) datasets.
Dataset Splits Yes The calibration set used consisted of 128 sequences, each comprising 2048 tokens, sampled from the first shard of the C4 training dataset, mirroring the approach adopted in the baseline methods.
Hardware Specification Yes All pruning experiments were executed on a single NVIDIA A100 GPU equipped with 80GB of memory.
Software Dependencies No Our proposed method, along with the baseline methods, was implemented using the Py Torch framework.
Experiment Setup Yes All pruning experiments were executed on a single NVIDIA A100 GPU equipped with 80GB of memory. Our proposed method, along with the baseline methods, was implemented using the Py Torch framework. The calibration set used consisted of 128 sequences, each comprising 2048 tokens, sampled from the first shard of the C4 training dataset, mirroring the approach adopted in the baseline methods. LLM models and datasets were sourced from the Huggingface Transformers library (Wolf et al., 2020). Zero-shot experiments were conducted with the assistance of the Language Model Evaluation Harness (LM-Eval) library (Gao et al., 2021). In this configuration, our proposed method achieved full sparsity in the LLa MA-65B model within a remarkable time frame of 4.5 hours. ... We pruned all linear layers, excluding embeddings and the model head, achieving a 50% unstructured sparsity level. ... We adopt 1 epoch of training as our default setting ... a sparsity step of 0.01 implies sparsity candidates ranging from 1.0 to 0.0 with a step size of 0.01.