reproducibilityindex.ai

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Authors: Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTAL VALIDATION Experimental setup. We focus on three main goals: 1) evaluating most compact representation with which Sp QR can replicate the performance of a 16-bit model within 1% perplexity, 2) controlling for the average number of bits per parameter across methods and compare to round-to-nearest (RTN) and GPTQ baselines, 3) finding the best trade-off in terms of model size and performance. For these settings, we evaluate the full Sp QR algorithm on publicly-available LLMs. We focus on the LLa MA-{7, 13, 30, 65}B model family (Touvron et al., 2023) and Falcon-{7, 40, 180}B model family (TII UAE, 2023a).
Researcher Affiliation	Collaboration	1 University of Washington 2 HSE University 3 Yandex 4 Skoltech 5 IST Austria 6 ETH Zurich 7 Neural Magic
Pseudocode	Yes	Algorithm 1 Sp QR quantization algorithm: the left snippet describes the full procedure, the right side contains subroutines for bilevel quantization and finding outliers. ... Algorithm 2 Sp QR quantization algorithm: the left snippet describes the full procedure, the right side contains subroutines for min-max quantization, bilevel quantization and finding outliers.
Open Source Code	No	We provide full configurations in Appendix C, as well as code which we plan to release publicly.
Open Datasets	Yes	We measure perplexity, measured on the Wiki Text2 (Merity et al., 2016), Penn Treebank (Marcus et al., 1994) and C4 (Raffel et al., 2020) datasets. Secondly, we measure zero-shot accuracy on five tasks: Wino Grande (Sakaguchi et al., 2021), Pi QA (Tata & Patel, 2003), Hella Swag, ARC-easy and ARC-challenge (Clark et al., 2018). We use the LM Evaluation Harness (Gao et al., 2021) with recommended parameters. ... We quantize LLa MA models using the Red Pajama dataset and Falcon models on Refined Web datasets (TII UAE, 2023b).
Dataset Splits	No	The paper mentions using 'calibration data' and 'test' data, and uses standard benchmarks, but does not explicitly provide percentages or counts for training, validation, and test splits used in their experimental setup.
Hardware Specification	Yes	Our implementation takes around 4.5 hours on the largest model size (65B) on an NVIDIA A100 (80 GB). Our emory efficient implementations take 12 hours on a small 24 GB GPU. ... For example, with 3.5 bit per parameter one can fit Llama-2-70b on a single V100 with 32Gb and have some space for KV cache, which would be impossible for GPTQ quantization with the same accuracy without weight offloading.
Software Dependencies	No	The paper mentions software like 'Py Torch', 'cu SPARSE', and 'Weights & Biases (Biewald, 2020)' but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	The full configuration we use to compress LLa MA-30B model near-losslessly in Table 1 has the following hyperparameters: bw = 4, bs = bz = 3, β1 = β2 = 16, τ = 0.1 This translates to the following command line arguments in our supplementary code: python main.py $MODEL custom --custom_data_path=$DATA \ --wbits 4 --groupsize 16 --perchannel --qq_scale_bits 3 \ --qq_zero_bits 3 --qq_groupsize 16 --outlier_threshold 0.1 \ --fit_quantizer_without_outliers --permutation_order act_order