OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
Authors: Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate Omni Quant s superior performance across diverse quantization configurations such as W4A4 (4-bit weight, 4-bit activation), W6A6, W4A16, W3A16, and W2A16. Additionally, Omni Quant demonstrates effectiveness in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices. |
| Researcher Affiliation | Collaboration | 1Open GVLab, Shanghai AI Laboratory 2The University of Hong Kong 3The Chinese University of Hong Kong |
| Pseudocode | Yes | Algorithm 1 Overall algorithm of Omni Quant. |
| Open Source Code | Yes | Codes are available at https://github.com/Open GVLab/Omni Quant. |
| Open Datasets | Yes | We employ a calibration dataset consisting of 128 randomly selected 2048-token segments from Wiki Text2 (Merity et al., 2016). Evaluation. Following the previous work (Lin et al., 2023; Frantar et al., 2022), we evaluate quantized models by reporting the perplexity of language generation experiments, specifically on Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994)), C4 (Raffel et al., 2020). |
| Dataset Splits | No | No explicit validation dataset split is mentioned. The paper uses a 'calibration dataset consisting of 128 randomly selected 2048-token segments from Wiki Text2' for optimizing quantization parameters, and then evaluates on various test datasets. |
| Hardware Specification | Yes | For instance, the LLa MA-2 model family size 7-70B can be processed with Omni Quant on a single A100-40G GPU within 116 hours using 128 samples. The entire training process is facilitated on a single Nvidia A100 GPU, using a batch size of 1 over 20 epochs. Table,3 shows memory requirements and inference speeds of the LLa MA family on an NVIDIA A100-80G. |
| Software Dependencies | No | No specific software dependencies with version numbers are listed in the paper. |
| Experiment Setup | Yes | To optimize the learnable parameters, we utilize the Adam W optimizer with zero weight decay. The learning rate for learnable weight clipping and equivalent transformation is set as 5e 3 and 1e 2, respectively. We employ a calibration dataset consisting of 128 randomly selected 2048-token segments from Wiki Text2 (Merity et al., 2016). The entire training process is facilitated on a single Nvidia A100 GPU, using a batch size of 1 over 20 epochs, except for W2A16 quantization that leverages 40 epochs. |