OPTQ: Accurate Quantization for Generative Pre-trained Transformers

Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000).
Researcher Affiliation Collaboration Elias Frantar IST Austria Saleh Ashkboos ETH Zurich Torsten Hoefler ETH Zurich Dan Alistarh IST Austria & Neural Magic
Pseudocode Yes Finally, we present the full pseudocode for OPTQ in Algorithm 1, including the optimizations discussed above.
Open Source Code Yes The implementation is available at https://github.com/IST-DASLab/gptq.
Open Datasets Yes Our entire OPTQ calibration data consists of 128 random 2048 token segments from the C4 dataset (Raffel et al., 2020), i.e., excerpts from randomly crawled websites, which represents generic text data.
Dataset Splits Yes For language generation experiments, we calculate the perplexity, in standard fashion like Radford et al. (2019), as follows: First, the entire validation set is concatenated using two linebreaks as separators and encoded using the default Hugging Face tokenizer of each model.
Hardware Specification Yes We quantized all models (including the 175 billion parameter variants) using a single NVIDIA A100 GPU with 80GB of memory. More accessible GPUs, such as the NVIDIA A6000, have much lower memory bandwidth, so this strategy is even more effective: executing the 3-bit OPT-175B model on 2x A6000 GPUs reduces latency from 589 milliseconds for FP16 inference (on 8 GPUs) to 130 milliseconds, a 4.5 latency reduction.
Software Dependencies No The paper mentions PyTorch and Hugging Face, but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'Hugging Face Transformers 4.x').
Experiment Setup Yes Our entire OPTQ calibration data consists of 128 random 2048 token segments from the C4 dataset (Raffel et al., 2020), i.e., excerpts from randomly crawled websites, which represents generic text data. We emphasize that this means that OPTQ does not see any task-specific data, and our results thus remain actually zero-shot . We perform standard uniform per-row asymmetric quantization on the min-max grid, similar to Dettmers et al. (2022). Additional evaluation details can be found in Appendix A.2.1. To ensure that the entire compression procedure can be performed with significantly less GPU memory than what would be required to run the full precision model, some care must be taken. Specifically, we always load one Transformer block, consisting of 6 layers, at a time into GPU memory and then accumulate the layer-Hessians and perform quantization. Finally, the current block inputs are sent through the fully quantized block again to produce the new inputs for the quantization of the next block. Hence, the quantization process operates not on the layer inputs in the full precision model but on the actual layer inputs in the already partially quantized one.