reproducibilityindex.ai

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

Authors: Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher M. De Sa

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/Cornell-Relax ML/Qu IP.
Researcher Affiliation	Academia	Jerry Chee Cornell University jerrychee@cs.cornell.edu Yaohui Cai Cornell University yc2632@cornell.edu Volodymyr Kuleshov Cornell University kuleshov@cornell.edu Christopher De Sa Cornell University cdesa@cs.cornell.edu
Pseudocode	Yes	Algorithm 1 Qu IP Incoherence Pre-Processing; Algorithm 2 Qu IP Incoherence Post-Processing; Algorithm 3 Qu IP: Quantization with Incoherence Processing
Open Source Code	Yes	Our code can be found at https://github.com/Cornell-Relax ML/Qu IP.
Open Datasets	Yes	Our calibration set is the same as OPTQ; 128 random 2048 token segments from the C4 dataset [25] consisting of generic text data from crawled websites.
Dataset Splits	No	The paper mentions a 'calibration set' used for quantization, but does not describe traditional train/validation/test dataset splits for model performance evaluation or specific percentages/counts for such splits.
Hardware Specification	Yes	We run experiments on a university cluster managed by a Slurm workload manager which has GPUs with up to 48GB of memory, though larger GPUs are only required for some methods on larger model sizes. [...] Average per-token throughput (batch size 1) when generating sequences of length 128 with OPT-66B on an A6000 GPU.
Software Dependencies	No	The experimental infrastructure is built on top of OPTQ s [8] repository which is implemented in Py Torch [23]. The paper mentions PyTorch but does not specify its version number.
Experiment Setup	Yes	Our calibration set is the same as OPTQ; 128 random 2048 token segments from the C4 dataset [25] consisting of generic text data from crawled websites. [...] For the incoherence-based quantization range, we tune the parameter ρ and find that a value of 2.4 works well across all model sizes and quantization methods. We use this value for all our experiments. [...] When Greedy updates are used, we perform 10 passes over the weights in the same order as LDLQ and OPTQ, except for 5 passes on OPT-30b and OPT-66b.