QTIP: Quantization with Trellises and Incoherence Processing
Authors: Albert Tseng, Qingyao Sun, David Hou, Christopher M. De Sa
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments Here, we present experiments quantizing the Llama family of models with QTIP [33, 34, 26]. and Table 3: Wikitext2 and C4 perplexity ( ), ctx. 4096, QTIP with pure-computed codes. |
| Researcher Affiliation | Academia | Albert Tseng Cornell University albert@cs.cornell.edu Qingyao Sun Cornell University qs234@cornell.edu David Hou dhou@alumni.caltech.edu Christopher De Sa Cornell University cdesa@cs.cornell.edu |
| Pseudocode | Yes | Algorithm 1 Computed Gaussian Code 1MAD, Algorithm 2 Computed Gaussian Code 3INST, Algorithm 3 Hybrid Computed-Lookup 2D Gaussian Code HYB, Algorithm 4 Tail-biting Trellis Approx. and Algorithm 5 QTIP with Block LDLQ. |
| Open Source Code | Yes | Our code is available at https://github.com/Cornell-Relax ML/qtip. |
| Open Datasets | Yes | All sequences were sampled from the Red Pajama dataset [7]. and We use the OPTQ Wikitext2 and C4 test splits to calculate perplexity [14]. |
| Dataset Splits | No | The paper mentions using 'Wikitext2 and C4 test splits' and data for Hessian generation, but does not explicitly provide the train/validation/test dataset splits (e.g., percentages, sample counts, or specific predefined split citations for the entire training and validation process) needed for reproduction. |
| Hardware Specification | Yes | Table 4: Batch size 1 decoding throughput on a RTX6000 Ada (960GB/s mem. BW). and Table 17: Decoding speed on different Ampere and Lovelace GPUs. listing RTX 3090, RTX A6000 Ampere, RTX 6000 Ada. |
| Software Dependencies | No | The paper mentions running on 'NVIDIA GPUs' and discussing 'ALU instructions', but it does not specify software dependencies like exact Python, PyTorch, or CUDA versions needed for reproducibility. |
| Experiment Setup | Yes | Here, we use 1MAD and 3INST with L = 16, V = 1, Tx = Ty = 16. and Here, we use the hybrid lookup-computed code with L = 16, V = 2, Tx = Ty = 16, Q = 9. and to evaluate this, we fine-tune using Qu IP# s methodology, tuning both the codebook entries and the as-yet-unquantized weights in a blockwise fashion. |