SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Authors: Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTAL VALIDATION Experimental setup. We focus on three main goals: 1) evaluating most compact representation with which Sp QR can replicate the performance of a 16-bit model within 1% perplexity, 2) controlling for the average number of bits per parameter across methods and compare to round-to-nearest (RTN) and GPTQ baselines, 3) finding the best trade-off in terms of model size and performance. For these settings, we evaluate the full Sp QR algorithm on publicly-available LLMs. We focus on the LLa MA-{7, 13, 30, 65}B model family (Touvron et al., 2023) and Falcon-{7, 40, 180}B model family (TII UAE, 2023a). |
| Researcher Affiliation | Collaboration | 1 University of Washington 2 HSE University 3 Yandex 4 Skoltech 5 IST Austria 6 ETH Zurich 7 Neural Magic |
| Pseudocode | Yes | Algorithm 1 Sp QR quantization algorithm: the left snippet describes the full procedure, the right side contains subroutines for bilevel quantization and finding outliers. ... Algorithm 2 Sp QR quantization algorithm: the left snippet describes the full procedure, the right side contains subroutines for min-max quantization, bilevel quantization and finding outliers. |
| Open Source Code | No | We provide full configurations in Appendix C, as well as code which we plan to release publicly. |
| Open Datasets | Yes | We measure perplexity, measured on the Wiki Text2 (Merity et al., 2016), Penn Treebank (Marcus et al., 1994) and C4 (Raffel et al., 2020) datasets. Secondly, we measure zero-shot accuracy on five tasks: Wino Grande (Sakaguchi et al., 2021), Pi QA (Tata & Patel, 2003), Hella Swag, ARC-easy and ARC-challenge (Clark et al., 2018). We use the LM Evaluation Harness (Gao et al., 2021) with recommended parameters. ... We quantize LLa MA models using the Red Pajama dataset and Falcon models on Refined Web datasets (TII UAE, 2023b). |
| Dataset Splits | No | The paper mentions using 'calibration data' and 'test' data, and uses standard benchmarks, but does not explicitly provide percentages or counts for training, validation, and test splits used in their experimental setup. |
| Hardware Specification | Yes | Our implementation takes around 4.5 hours on the largest model size (65B) on an NVIDIA A100 (80 GB). Our emory efficient implementations take 12 hours on a small 24 GB GPU. ... For example, with 3.5 bit per parameter one can fit Llama-2-70b on a single V100 with 32Gb and have some space for KV cache, which would be impossible for GPTQ quantization with the same accuracy without weight offloading. |
| Software Dependencies | No | The paper mentions software like 'Py Torch', 'cu SPARSE', and 'Weights & Biases (Biewald, 2020)' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | The full configuration we use to compress LLa MA-30B model near-losslessly in Table 1 has the following hyperparameters: bw = 4, bs = bz = 3, β1 = β2 = 16, τ = 0.1 This translates to the following command line arguments in our supplementary code: python main.py $MODEL custom --custom_data_path=$DATA \ --wbits 4 --groupsize 16 --perchannel --qq_scale_bits 3 \ --qq_zero_bits 3 --qq_groupsize 16 --outlier_threshold 0.1 \ --fit_quantizer_without_outliers --permutation_order act_order |