FrameQuant: Flexible Low-Bit Quantization for Transformers

Authors: Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We performed an extensive set of experiments comparing Frame Quant with several quantization baselines for Vision models and Language models. The goal is to assess (a) performance metrics of different methods on benchmark tasks and (b) how close low-bit quantization can approach the full precision performance with a small degree of representation redundancy. We use image classification task (Deng et al., 2009) for Vision models and Perplexity for Language models.
Researcher Affiliation Collaboration 1University of Wisconsin-Madison 2Google Research. Correspondence to: Harshavardhan Adepu <adepu@wisc.edu>
Pseudocode Yes Algorithm 1 Frame Quant Require: Weight matrix Θl, previous layer activations Aprev, input and output Fusion Frames Pl, Pprev, block size B 1: Compute Cprev = P T prev Aprev, Dl = P T l Θl Pprev 2: Compute σ = std(Dl), µ = mean(Dl) 3: Dl = 2σ clip(Dl, µ 2σ, µ + 2σ) 4: ˆDl = quantize(Dl, Cprev, B) // modified GPTQ 5: Store ˆDl // store the quantized matrix ˆDl return Pl ˆDl Cprev // return quantized layer activations
Open Source Code Yes The code is available at https://github.com/ vsingh-group/Frame Quant
Open Datasets Yes We evaluate our method on the Image Net-1K classification task.
Dataset Splits Yes Finally, we evaluate the quantized models on the Image Net-1K validation dataset and report the top-1 accuracy.
Hardware Specification Yes Table 7 shows the inference speeds of the quantized models on a Nvidia A100 GPU.
Software Dependencies No The paper mentions using Huggingface hub, but does not list specific software dependencies (e.g., Python, PyTorch, or other libraries) with version numbers.
Experiment Setup Yes For quantizing the model weights of the pre-trained models obtained from the Huggingface hub (Wightman, 2019), we use 128 images randomly selected images from the training dataset as calibration dataset D. We quantize the parameter matrices of the layers sequentially from shallow layers to deep layers, similar to (Frantar et al., 2023).