SparseLLM: Towards Global Pruning of Pre-trained Language Models

Authors: Guangji Bai, Yijiang Li, Chen LING, Kibaek Kim, Liang Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we find that Sparse LLM can consistently improve the performance of local pruning methods, particularly in high sparsity regimes (> 60%), where the perplexity can be significantly decreased by up to around 80% as compared to the state-of-the-art methods. Our source code is publicly available at https://github.com/Bai The Best/Sparse LLM. We implemented Sparse LLM in Py Torch [30] and use the Hugging Face Transformers library [31] for handling models and datasets. All pruning experiments are conducted on NVIDIA A100 GPUs.
Researcher Affiliation Academia Guangji Bai1 Yijiang Li2 Chen Ling1 Kibaek Kim2 Liang Zhao1, 1 Emory University, Atlanta, GA, USA 2Argonne National Laboratory, Lemont, IL, USA
Pseudocode Yes Algorithm 1 Sparse LLM Pruning of OPT Models. Input: An OPT decoder layer containing FFN and MHA modules. FFN s up-scaling linear layer pre-trained weight matrix Wℓ, FFN s down-scaling linear layer pre-trained weight matrix Wℓ+1, input of the up-scaling linear layer apre ℓ 1, output of the down-scaling linear layer zpre ℓ+1, target sparsity ρ, constraint weight hyperparameters α, β.
Open Source Code Yes Our source code is publicly available at https://github.com/Bai The Best/Sparse LLM.
Open Datasets Yes For calibration data, we follow [12] and use 128 2048-token segments, randomly chosen from the first shard of the C4 [32] dataset. This represents generic text data crawled from the internet and ensures our experiments are zero-shot as no task-specific data is seen during pruning. We consider the OPT model family [33] and Lla MA-2 model family [1] in our experiments as well as the most recent Lla MA-3 model. We consider the test sets of raw-Wiki Text2 [36] (WT2) and PTB [37] as well as a subset of the C4 validation data, all popular benchmarks in LLM compression literature [34, 38, 12, 13].
Dataset Splits Yes We consider the test sets of raw-Wiki Text2 [36] (WT2) and PTB [37] as well as a subset of the C4 validation data, all popular benchmarks in LLM compression literature [34, 38, 12, 13].
Hardware Specification Yes All pruning experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies No We implemented Sparse LLM in Py Torch [30] and use the Hugging Face Transformers library [31] for handling models and datasets. No specific version numbers for these software components were provided.
Experiment Setup Yes For each model, we consider unstructured sparsity ranging from 70% to 90% with a 10% increment, as well as a 3:4 semi-structured sparsity. We prune the first 50% of the Transformer decoder layers in each model to achieve a balance between the computation resources and the performances. We conduct a sensitivity study on the calibration sample sizes (see Appendix A.3) and use calibration sample sizes between 32 and 64 for all experiments. We select α and β from the set {0.01, 0.1, 1, 5, 10, 100} and perform a study on models to understand the impact of the hyperparameters.