QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Authors: Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement Qua Rot using Hugging Face [Wolf et al., 2019] on top of the Py Torch framework [Paszke et al., 2019]. To quantize the inputs, we use per-token symmetric quantization (a single scale for every row) with a constant clipping ratio of 0.9 in all our experiments. We quantize the KV caches using asymmetric quantization with a group size 128 with a constant clipping ratio of 0.95. For weight quantization, we use round-to-nearest (RTN) and GPTQ [Frantar et al., 2022] with per-column (also known as per-channel) symmetric quantization, where we extract the clipping ratio using a linear search over the squared error. We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization. |
| Researcher Affiliation | Collaboration | Saleh Ashkboos ETH Zurich saleh.ashkboos@inf.ethz.ch Amirkeivan Mohtashami EPFL amirkeivan.mohtashami@epfl.ch Maximilian L. Croci Microsoft Research mcroci@microsoft.com Bo Li ETH Zurich bolibo@ethz.ch Pashmina Cameron Microsoft pcameron@microsoft.com Martin Jaggi EPFL martin.jaggi@epfl.ch Dan Alistarh IST Austria & Neural Magic dan.alistarh@ist.ac.at Torsten Hoefler ETH Zurich torsten.hoefler@inf.ethz.ch James Hensman Microsoft Research jameshensman@microsoft.com |
| Pseudocode | No | The paper contains flow diagrams (Figures 2, 3, 5, 6) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at github.com/spcl/Qua Rot. |
| Open Datasets | Yes | We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization. |
| Dataset Splits | No | The paper mentions using "Wiki Text-2 [Merity et al., 2016] training set" for calibration, but does not explicitly provide specific training/validation/test dataset splits (percentages, sample counts, or explicit references to standard splits with citations) for reproduction. |
| Hardware Specification | Yes | On a single NVIDIA A100 GPU, modifying LLAMA2-70B with Qua Rot takes 5 minutes and quantizing the model with GPTQ takes a further 2 hours. As we target consumer-type GPUs, we evaluate all the performance experiments on NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions software like Hugging Face, PyTorch, CUTLASS, and Flash Infer, and specifies "CUDA/12.1". However, it does not provide specific version numbers for the other key software libraries used (e.g., PyTorch version, Hugging Face Transformers version). |
| Experiment Setup | Yes | To quantize the inputs, we use per-token symmetric quantization (a single scale for every row) with a constant clipping ratio of 0.9 in all our experiments. We quantize the KV caches using asymmetric quantization with a group size 128 with a constant clipping ratio of 0.95. For weight quantization, we use round-to-nearest (RTN) and GPTQ [Frantar et al., 2022] with per-column (also known as per-channel) symmetric quantization, where we extract the clipping ratio using a linear search over the squared error. We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization. |