QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Authors: Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement Qua Rot using Hugging Face [Wolf et al., 2019] on top of the Py Torch framework [Paszke et al., 2019]. To quantize the inputs, we use per-token symmetric quantization (a single scale for every row) with a constant clipping ratio of 0.9 in all our experiments. We quantize the KV caches using asymmetric quantization with a group size 128 with a constant clipping ratio of 0.95. For weight quantization, we use round-to-nearest (RTN) and GPTQ [Frantar et al., 2022] with per-column (also known as per-channel) symmetric quantization, where we extract the clipping ratio using a linear search over the squared error. We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization.
Researcher Affiliation Collaboration Saleh Ashkboos ETH Zurich saleh.ashkboos@inf.ethz.ch Amirkeivan Mohtashami EPFL amirkeivan.mohtashami@epfl.ch Maximilian L. Croci Microsoft Research mcroci@microsoft.com Bo Li ETH Zurich bolibo@ethz.ch Pashmina Cameron Microsoft pcameron@microsoft.com Martin Jaggi EPFL martin.jaggi@epfl.ch Dan Alistarh IST Austria & Neural Magic dan.alistarh@ist.ac.at Torsten Hoefler ETH Zurich torsten.hoefler@inf.ethz.ch James Hensman Microsoft Research jameshensman@microsoft.com
Pseudocode No The paper contains flow diagrams (Figures 2, 3, 5, 6) but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at github.com/spcl/Qua Rot.
Open Datasets Yes We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization.
Dataset Splits No The paper mentions using "Wiki Text-2 [Merity et al., 2016] training set" for calibration, but does not explicitly provide specific training/validation/test dataset splits (percentages, sample counts, or explicit references to standard splits with citations) for reproduction.
Hardware Specification Yes On a single NVIDIA A100 GPU, modifying LLAMA2-70B with Qua Rot takes 5 minutes and quantizing the model with GPTQ takes a further 2 hours. As we target consumer-type GPUs, we evaluate all the performance experiments on NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions software like Hugging Face, PyTorch, CUTLASS, and Flash Infer, and specifies "CUDA/12.1". However, it does not provide specific version numbers for the other key software libraries used (e.g., PyTorch version, Hugging Face Transformers version).
Experiment Setup Yes To quantize the inputs, we use per-token symmetric quantization (a single scale for every row) with a constant clipping ratio of 0.9 in all our experiments. We quantize the KV caches using asymmetric quantization with a group size 128 with a constant clipping ratio of 0.95. For weight quantization, we use round-to-nearest (RTN) and GPTQ [Frantar et al., 2022] with per-column (also known as per-channel) symmetric quantization, where we extract the clipping ratio using a linear search over the squared error. We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization.