QuIP$#$: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
Authors: Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that Qu IP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Cornell University 2Department of Operations Research and Information Engineering, Cornell University. |
| Pseudocode | Yes | Algorithm 1 Qu IP# without Fine-Tuning (Qu IP#-No FT) input Weight W Rm n, hessians H Rn n, g-dim. k-bit codebook C ... Algorithm 2 Qu IP# Inference (for a Linear Layer) ... Algorithm 3 Incoherence Processing with RHT (IP-RHT) ... Algorithm 4 Incoherence Processing with RFFT (IP-RFFT) ... Algorithm 5 Qu IP# with Fine-Tuning |
| Open Source Code | Yes | Our code can be found at https://github.com/ Cornell-Relax ML/quip-sharp. |
| Open Datasets | Yes | Hessian matrices H were generated with 6144 sequences of a model s native context length (2048 for Llama 1, 4096 for Llama 2) from the Red Pajama 1T (Computer, 2023) dataset. |
| Dataset Splits | Yes | We train on small development dataset of 256 sequences from Red Pajama 1T and validate on 128 sequences. |
| Hardware Specification | Yes | All experiments were run on NVIDIA A100 GPUs except for the timing numbers, which were measured on a NVIDIA RTX 4090 |
| Software Dependencies | No | The paper mentions software components like "Flash Attention library", "Hugging Face library", and "CUDA kernel" but does not specify their version numbers for reproducibility. |
| Experiment Setup | Yes | For the within-transformer block section of fine-tuning, we use the Adam optimizer (Kingma & Ba, 2017), a learning rate of 5 10 5, batch size of 8, and sequence length equal to the model s native context length. We train on small development dataset of 256 sequences from Red Pajama 1T and validate on 128 sequences. We train for 5 epochs (160 steps) and keep the best model parameters based on the validation set. |