SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Authors: Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experimentation we show that Slice GPT can remove up to 25% of the model parameters (including embeddings) for LLAMA-2 70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA-2 70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables Slice GPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/Transformer Compression . |
| Researcher Affiliation | Collaboration | Saleh Ashkboos ETH Zurich Maximilian L. Croci Microsoft Research Marcelo Gennari do Nascimento Microsoft Torsten Hoefler ETH Zurich James Hensman Microsoft Research |
| Pseudocode | Yes | Algorithm 1 The forward pass of a transformer network |
| Open Source Code | Yes | Code is available at: https://github.com/microsoft/TransformerCompression . |
| Open Datasets | Yes | We experiment with two different calibration sets: the Wiki Text-2 training dataset (Merity et al., 2016) and the Alpaca training dataset (Taori et al., 2023). |
| Dataset Splits | No | The paper uses pre-trained models and evaluates on zero-shot tasks using the LM Evaluation Harness, but does not explicitly state training/validation/test dataset splits with percentages or sample counts for its own experiments beyond calibration set sizes. |
| Hardware Specification | Yes | The computation of Q is performed on a single H100 GPU with 80GB of memory, taking approximately 3.5 hours to complete for the LLAMA-2 70B model. We use double precision for the PCA calculation because using single precision for eigenvector calculations in Py Torch leads to a discrepancy in the final accuracy, as detailed in Appendix A.2. We evaluate all our experiments on OPT (Zhang et al., 2022), LLAMA-2 (Touvron et al., 2023) model families, and additionally evaluate Phi-2 (in our zero-shot task) experiments. We exclude OPT 175B, as it is outperformed by smaller LLAMA-2 models. Nonetheless, we anticipate that this larger model will yield improved results, as larger models typically offer more promising opportunities for compression (see Section 4.1). We evaluate our scheme on both language generation as well as popular zero-shot tasks. To demonstrate the comprehensive speedup achieved by Slice GPT we use: Quadro RTX6000 GPUs with 24GB of memory as a representative example of consumer-level GPUs; 40GB A100s and 80GB H100s to provide datacenter-level benchmarks. |
| Software Dependencies | Yes | We use the Cu Sparse LT 0.5 library to run sparse matrix multiplications on an 80 GB A100 GPU |
| Experiment Setup | Yes | We apply a small amount of RFT to sliced LLAMA-2 and Phi-2 models using LoRA (Hu et al., 2021), following the idea from Ma et al. (2023a). For models sliced with Wiki Text-2 we use approximately 1k sequences, for those sliced with the Alpaca dataset we use 5k. We use LoRA with r = 32, α = 10 and sequence length 1024, and defaults for all other hyperparameters in PEFT (Mangrulkar et al., 2022). |