OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Authors: Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate Omni Quant s superior performance across diverse quantization configurations such as W4A4 (4-bit weight, 4-bit activation), W6A6, W4A16, W3A16, and W2A16. Additionally, Omni Quant demonstrates effectiveness in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices.
Researcher Affiliation Collaboration 1Open GVLab, Shanghai AI Laboratory 2The University of Hong Kong 3The Chinese University of Hong Kong
Pseudocode Yes Algorithm 1 Overall algorithm of Omni Quant.
Open Source Code Yes Codes are available at https://github.com/Open GVLab/Omni Quant.
Open Datasets Yes We employ a calibration dataset consisting of 128 randomly selected 2048-token segments from Wiki Text2 (Merity et al., 2016). Evaluation. Following the previous work (Lin et al., 2023; Frantar et al., 2022), we evaluate quantized models by reporting the perplexity of language generation experiments, specifically on Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994)), C4 (Raffel et al., 2020).
Dataset Splits No No explicit validation dataset split is mentioned. The paper uses a 'calibration dataset consisting of 128 randomly selected 2048-token segments from Wiki Text2' for optimizing quantization parameters, and then evaluates on various test datasets.
Hardware Specification Yes For instance, the LLa MA-2 model family size 7-70B can be processed with Omni Quant on a single A100-40G GPU within 116 hours using 128 samples. The entire training process is facilitated on a single Nvidia A100 GPU, using a batch size of 1 over 20 epochs. Table,3 shows memory requirements and inference speeds of the LLa MA family on an NVIDIA A100-80G.
Software Dependencies No No specific software dependencies with version numbers are listed in the paper.
Experiment Setup Yes To optimize the learnable parameters, we utilize the Adam W optimizer with zero weight decay. The learning rate for learnable weight clipping and equivalent transformation is set as 5e 3 and 1e 2, respectively. We employ a calibration dataset consisting of 128 randomly selected 2048-token segments from Wiki Text2 (Merity et al., 2016). The entire training process is facilitated on a single Nvidia A100 GPU, using a batch size of 1 over 20 epochs, except for W2A16 quantization that leverages 40 epochs.