SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot

Authors: Elias Frantar, Dan Alistarh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called Sparse GPT, specifically designed to work efficiently and accurately on massive GPT-family models. Our experiments, from which we provide a snapshot in Figures 1 and 2, lead to the following observations.
Researcher Affiliation Collaboration 1Institute of Science and Technology Austria (ISTA) 2Neural Magic Inc.
Pseudocode Yes Algorithm 1 The Sparse GPT algorithm. We prune the layer matrix W to p% unstructured sparsity given inverse Hessian H 1 = (XX + λI) 1, lazy batch-update blocksize B and adaptive mask selection blocksize Bs; each Bs consecutive columns will be p% sparse.
Open Source Code Yes The code is available at: https://github.com/IST-DASLab/ sparsegpt.
Open Datasets Yes For calibration data, we follow Frantar et al. (2022a) and use 128 2048-token segments, randomly chosen from the first shard of the C4 (Raffel et al., 2020) dataset. We consider the test sets of raw-Wiki Text2 (Merity et al., 2016) and PTB (Marcus et al., 1994) as well as a subset of the C4 validation data, all popular benchmarks in LLM compression literature.
Dataset Splits Yes For calibration data, we follow Frantar et al. (2022a) and use 128 2048-token segments, randomly chosen from the first shard of the C4 (Raffel et al., 2020) dataset. We consider the test sets of raw-Wiki Text2 (Merity et al., 2016) and PTB (Marcus et al., 1994) as well as a subset of the C4 validation data, all popular benchmarks in LLM compression literature.
Hardware Specification Yes All pruning experiments are conducted on a single NVIDIA A100 GPU with 80GB of memory.
Software Dependencies No The paper mentions implementing Sparse GPT in PyTorch and using the Hugging Face Transformers library, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes All pruning experiments are conducted on a single NVIDIA A100 GPU with 80GB of memory. In this setup, Sparse GPT can fully sparsify the 175-billion-parameter models in approximately 4 hours. For calibration data, we follow Frantar et al. (2022a) and use 128 2048-token segments, randomly chosen from the first shard of the C4 (Raffel et al., 2020) dataset. We choose blocksize 128 which lies in that range while also slightly simplifying the algorithm implementation as it matches the default lazy weight update batchsize.