reproducibilityindex.ai

OPTQ: Accurate Quantization for Generative Pre-trained Transformers

Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000).
Researcher Affiliation	Collaboration	Elias Frantar IST Austria Saleh Ashkboos ETH Zurich Torsten Hoefler ETH Zurich Dan Alistarh IST Austria & Neural Magic
Pseudocode	Yes	Finally, we present the full pseudocode for OPTQ in Algorithm 1, including the optimizations discussed above.
Open Source Code	Yes	The implementation is available at https://github.com/IST-DASLab/gptq.
Open Datasets	Yes	Our entire OPTQ calibration data consists of 128 random 2048 token segments from the C4 dataset (Raffel et al., 2020), i.e., excerpts from randomly crawled websites, which represents generic text data.
Dataset Splits	Yes	For language generation experiments, we calculate the perplexity, in standard fashion like Radford et al. (2019), as follows: First, the entire validation set is concatenated using two linebreaks as separators and encoded using the default Hugging Face tokenizer of each model.
Hardware Specification	Yes	We quantized all models (including the 175 billion parameter variants) using a single NVIDIA A100 GPU with 80GB of memory. More accessible GPUs, such as the NVIDIA A6000, have much lower memory bandwidth, so this strategy is even more effective: executing the 3-bit OPT-175B model on 2x A6000 GPUs reduces latency from 589 milliseconds for FP16 inference (on 8 GPUs) to 130 milliseconds, a 4.5 latency reduction.
Software Dependencies	No	The paper mentions PyTorch and Hugging Face, but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'Hugging Face Transformers 4.x').
Experiment Setup	Yes	Our entire OPTQ calibration data consists of 128 random 2048 token segments from the C4 dataset (Raffel et al., 2020), i.e., excerpts from randomly crawled websites, which represents generic text data. We emphasize that this means that OPTQ does not see any task-specific data, and our results thus remain actually zero-shot . We perform standard uniform per-row asymmetric quantization on the min-max grid, similar to Dettmers et al. (2022). Additional evaluation details can be found in Appendix A.2.1. To ensure that the entire compression procedure can be performed with significantly less GPU memory than what would be required to run the full precision model, some care must be taken. Specifically, we always load one Transformer block, consisting of 6 layers, at a time into GPU memory and then accumulate the layer-Hessians and perform quantization. Finally, the current block inputs are sent through the fully quantized block again to produce the new inputs for the quantization of the next block. Hence, the quantization process operates not on the layer inputs in the full precision model but on the actual layer inputs in the already partially quantized one.