QuIP$#$: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

Authors: Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that Qu IP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference.
Researcher Affiliation Academia 1Department of Computer Science, Cornell University 2Department of Operations Research and Information Engineering, Cornell University.
Pseudocode Yes Algorithm 1 Qu IP# without Fine-Tuning (Qu IP#-No FT) input Weight W Rm n, hessians H Rn n, g-dim. k-bit codebook C ... Algorithm 2 Qu IP# Inference (for a Linear Layer) ... Algorithm 3 Incoherence Processing with RHT (IP-RHT) ... Algorithm 4 Incoherence Processing with RFFT (IP-RFFT) ... Algorithm 5 Qu IP# with Fine-Tuning
Open Source Code Yes Our code can be found at https://github.com/ Cornell-Relax ML/quip-sharp.
Open Datasets Yes Hessian matrices H were generated with 6144 sequences of a model s native context length (2048 for Llama 1, 4096 for Llama 2) from the Red Pajama 1T (Computer, 2023) dataset.
Dataset Splits Yes We train on small development dataset of 256 sequences from Red Pajama 1T and validate on 128 sequences.
Hardware Specification Yes All experiments were run on NVIDIA A100 GPUs except for the timing numbers, which were measured on a NVIDIA RTX 4090
Software Dependencies No The paper mentions software components like "Flash Attention library", "Hugging Face library", and "CUDA kernel" but does not specify their version numbers for reproducibility.
Experiment Setup Yes For the within-transformer block section of fine-tuning, we use the Adam optimizer (Kingma & Ba, 2017), a learning rate of 5 10 5, batch size of 8, and sequence length equal to the model s native context length. We train on small development dataset of 256 sequences from Red Pajama 1T and validate on 128 sequences. We train for 5 epochs (160 steps) and keep the best model parameters based on the validation set.