QTIP: Quantization with Trellises and Incoherence Processing

Authors: Albert Tseng, Qingyao Sun, David Hou, Christopher M. De Sa

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Here, we present experiments quantizing the Llama family of models with QTIP [33, 34, 26]. and Table 3: Wikitext2 and C4 perplexity ( ), ctx. 4096, QTIP with pure-computed codes.
Researcher Affiliation Academia Albert Tseng Cornell University albert@cs.cornell.edu Qingyao Sun Cornell University qs234@cornell.edu David Hou dhou@alumni.caltech.edu Christopher De Sa Cornell University cdesa@cs.cornell.edu
Pseudocode Yes Algorithm 1 Computed Gaussian Code 1MAD, Algorithm 2 Computed Gaussian Code 3INST, Algorithm 3 Hybrid Computed-Lookup 2D Gaussian Code HYB, Algorithm 4 Tail-biting Trellis Approx. and Algorithm 5 QTIP with Block LDLQ.
Open Source Code Yes Our code is available at https://github.com/Cornell-Relax ML/qtip.
Open Datasets Yes All sequences were sampled from the Red Pajama dataset [7]. and We use the OPTQ Wikitext2 and C4 test splits to calculate perplexity [14].
Dataset Splits No The paper mentions using 'Wikitext2 and C4 test splits' and data for Hessian generation, but does not explicitly provide the train/validation/test dataset splits (e.g., percentages, sample counts, or specific predefined split citations for the entire training and validation process) needed for reproduction.
Hardware Specification Yes Table 4: Batch size 1 decoding throughput on a RTX6000 Ada (960GB/s mem. BW). and Table 17: Decoding speed on different Ampere and Lovelace GPUs. listing RTX 3090, RTX A6000 Ampere, RTX 6000 Ada.
Software Dependencies No The paper mentions running on 'NVIDIA GPUs' and discussing 'ALU instructions', but it does not specify software dependencies like exact Python, PyTorch, or CUDA versions needed for reproducibility.
Experiment Setup Yes Here, we use 1MAD and 3INST with L = 16, V = 1, Tx = Ty = 16. and Here, we use the hybrid lookup-computed code with L = 16, V = 2, Tx = Ty = 16, Q = 9. and to evaluate this, we fine-tune using Qu IP# s methodology, tuning both the codebook entries and the as-yet-unquantized weights in a blockwise fashion.