QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Authors: Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher M. De Sa
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/Cornell-Relax ML/Qu IP. |
| Researcher Affiliation | Academia | Jerry Chee Cornell University jerrychee@cs.cornell.edu Yaohui Cai Cornell University yc2632@cornell.edu Volodymyr Kuleshov Cornell University kuleshov@cornell.edu Christopher De Sa Cornell University cdesa@cs.cornell.edu |
| Pseudocode | Yes | Algorithm 1 Qu IP Incoherence Pre-Processing; Algorithm 2 Qu IP Incoherence Post-Processing; Algorithm 3 Qu IP: Quantization with Incoherence Processing |
| Open Source Code | Yes | Our code can be found at https://github.com/Cornell-Relax ML/Qu IP. |
| Open Datasets | Yes | Our calibration set is the same as OPTQ; 128 random 2048 token segments from the C4 dataset [25] consisting of generic text data from crawled websites. |
| Dataset Splits | No | The paper mentions a 'calibration set' used for quantization, but does not describe traditional train/validation/test dataset splits for model performance evaluation or specific percentages/counts for such splits. |
| Hardware Specification | Yes | We run experiments on a university cluster managed by a Slurm workload manager which has GPUs with up to 48GB of memory, though larger GPUs are only required for some methods on larger model sizes. [...] Average per-token throughput (batch size 1) when generating sequences of length 128 with OPT-66B on an A6000 GPU. |
| Software Dependencies | No | The experimental infrastructure is built on top of OPTQ s [8] repository which is implemented in Py Torch [23]. The paper mentions PyTorch but does not specify its version number. |
| Experiment Setup | Yes | Our calibration set is the same as OPTQ; 128 random 2048 token segments from the C4 dataset [25] consisting of generic text data from crawled websites. [...] For the incoherence-based quantization range, we tune the parameter ρ and find that a value of 2.4 works well across all model sizes and quantization methods. We use this value for all our experiments. [...] When Greedy updates are used, we perform 10 passes over the weights in the same order as LDLQ and OPTQ, except for 5 passes on OPT-30b and OPT-66b. |