SparseLLM: Towards Global Pruning of Pre-trained Language Models
Authors: Guangji Bai, Yijiang Li, Chen LING, Kibaek Kim, Liang Zhao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that Sparse LLM can consistently improve the performance of local pruning methods, particularly in high sparsity regimes (> 60%), where the perplexity can be significantly decreased by up to around 80% as compared to the state-of-the-art methods. Our source code is publicly available at https://github.com/Bai The Best/Sparse LLM. We implemented Sparse LLM in Py Torch [30] and use the Hugging Face Transformers library [31] for handling models and datasets. All pruning experiments are conducted on NVIDIA A100 GPUs. |
| Researcher Affiliation | Academia | Guangji Bai1 Yijiang Li2 Chen Ling1 Kibaek Kim2 Liang Zhao1, 1 Emory University, Atlanta, GA, USA 2Argonne National Laboratory, Lemont, IL, USA |
| Pseudocode | Yes | Algorithm 1 Sparse LLM Pruning of OPT Models. Input: An OPT decoder layer containing FFN and MHA modules. FFN s up-scaling linear layer pre-trained weight matrix Wℓ, FFN s down-scaling linear layer pre-trained weight matrix Wℓ+1, input of the up-scaling linear layer apre ℓ 1, output of the down-scaling linear layer zpre ℓ+1, target sparsity ρ, constraint weight hyperparameters α, β. |
| Open Source Code | Yes | Our source code is publicly available at https://github.com/Bai The Best/Sparse LLM. |
| Open Datasets | Yes | For calibration data, we follow [12] and use 128 2048-token segments, randomly chosen from the first shard of the C4 [32] dataset. This represents generic text data crawled from the internet and ensures our experiments are zero-shot as no task-specific data is seen during pruning. We consider the OPT model family [33] and Lla MA-2 model family [1] in our experiments as well as the most recent Lla MA-3 model. We consider the test sets of raw-Wiki Text2 [36] (WT2) and PTB [37] as well as a subset of the C4 validation data, all popular benchmarks in LLM compression literature [34, 38, 12, 13]. |
| Dataset Splits | Yes | We consider the test sets of raw-Wiki Text2 [36] (WT2) and PTB [37] as well as a subset of the C4 validation data, all popular benchmarks in LLM compression literature [34, 38, 12, 13]. |
| Hardware Specification | Yes | All pruning experiments are conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | We implemented Sparse LLM in Py Torch [30] and use the Hugging Face Transformers library [31] for handling models and datasets. No specific version numbers for these software components were provided. |
| Experiment Setup | Yes | For each model, we consider unstructured sparsity ranging from 70% to 90% with a 10% increment, as well as a 3:4 semi-structured sparsity. We prune the first 50% of the Transformer decoder layers in each model to achieve a balance between the computation resources and the performances. We conduct a sensitivity study on the calibration sample sizes (see Appendix A.3) and use calibration sample sizes between 32 and 64 for all experiments. We select α and β from the set {0.01, 0.1, 1, 5, 10, 100} and perform a study on models to understand the impact of the hyperparameters. |