SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot
Authors: Elias Frantar, Dan Alistarh
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called Sparse GPT, specifically designed to work efficiently and accurately on massive GPT-family models. Our experiments, from which we provide a snapshot in Figures 1 and 2, lead to the following observations. |
| Researcher Affiliation | Collaboration | 1Institute of Science and Technology Austria (ISTA) 2Neural Magic Inc. |
| Pseudocode | Yes | Algorithm 1 The Sparse GPT algorithm. We prune the layer matrix W to p% unstructured sparsity given inverse Hessian H 1 = (XX + λI) 1, lazy batch-update blocksize B and adaptive mask selection blocksize Bs; each Bs consecutive columns will be p% sparse. |
| Open Source Code | Yes | The code is available at: https://github.com/IST-DASLab/ sparsegpt. |
| Open Datasets | Yes | For calibration data, we follow Frantar et al. (2022a) and use 128 2048-token segments, randomly chosen from the first shard of the C4 (Raffel et al., 2020) dataset. We consider the test sets of raw-Wiki Text2 (Merity et al., 2016) and PTB (Marcus et al., 1994) as well as a subset of the C4 validation data, all popular benchmarks in LLM compression literature. |
| Dataset Splits | Yes | For calibration data, we follow Frantar et al. (2022a) and use 128 2048-token segments, randomly chosen from the first shard of the C4 (Raffel et al., 2020) dataset. We consider the test sets of raw-Wiki Text2 (Merity et al., 2016) and PTB (Marcus et al., 1994) as well as a subset of the C4 validation data, all popular benchmarks in LLM compression literature. |
| Hardware Specification | Yes | All pruning experiments are conducted on a single NVIDIA A100 GPU with 80GB of memory. |
| Software Dependencies | No | The paper mentions implementing Sparse GPT in PyTorch and using the Hugging Face Transformers library, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | All pruning experiments are conducted on a single NVIDIA A100 GPU with 80GB of memory. In this setup, Sparse GPT can fully sparsify the 175-billion-parameter models in approximately 4 hours. For calibration data, we follow Frantar et al. (2022a) and use 128 2048-token segments, randomly chosen from the first shard of the C4 (Raffel et al., 2020) dataset. We choose blocksize 128 which lies in that range while also slightly simplifying the algorithm implementation as it matches the default lazy weight update batchsize. |