FrameQuant: Flexible Low-Bit Quantization for Transformers
Authors: Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We performed an extensive set of experiments comparing Frame Quant with several quantization baselines for Vision models and Language models. The goal is to assess (a) performance metrics of different methods on benchmark tasks and (b) how close low-bit quantization can approach the full precision performance with a small degree of representation redundancy. We use image classification task (Deng et al., 2009) for Vision models and Perplexity for Language models. |
| Researcher Affiliation | Collaboration | 1University of Wisconsin-Madison 2Google Research. Correspondence to: Harshavardhan Adepu <adepu@wisc.edu> |
| Pseudocode | Yes | Algorithm 1 Frame Quant Require: Weight matrix Θl, previous layer activations Aprev, input and output Fusion Frames Pl, Pprev, block size B 1: Compute Cprev = P T prev Aprev, Dl = P T l Θl Pprev 2: Compute σ = std(Dl), µ = mean(Dl) 3: Dl = 2σ clip(Dl, µ 2σ, µ + 2σ) 4: ˆDl = quantize(Dl, Cprev, B) // modified GPTQ 5: Store ˆDl // store the quantized matrix ˆDl return Pl ˆDl Cprev // return quantized layer activations |
| Open Source Code | Yes | The code is available at https://github.com/ vsingh-group/Frame Quant |
| Open Datasets | Yes | We evaluate our method on the Image Net-1K classification task. |
| Dataset Splits | Yes | Finally, we evaluate the quantized models on the Image Net-1K validation dataset and report the top-1 accuracy. |
| Hardware Specification | Yes | Table 7 shows the inference speeds of the quantized models on a Nvidia A100 GPU. |
| Software Dependencies | No | The paper mentions using Huggingface hub, but does not list specific software dependencies (e.g., Python, PyTorch, or other libraries) with version numbers. |
| Experiment Setup | Yes | For quantizing the model weights of the pre-trained models obtained from the Huggingface hub (Wightman, 2019), we use 128 images randomly selected images from the training dataset as calibration dataset D. We quantize the parameter matrices of the layers sequentially from shallow layers to deep layers, similar to (Frantar et al., 2023). |