QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization

Authors: Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, Fengwei Yu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various tasks including computer vision (image classification, object detection) and natural language processing (text classification and question answering) prove its superiority.
Researcher Affiliation Collaboration Xiuying Wei1, 2 , Ruihao Gong1, 2 , Yuhang Li2, Xianglong Liu1 , Fengwei Yu2 1State Key Lab of Software Development Environment, Beihang University, 2Sense Time Research
Pseudocode Yes Algorithm 1: QDROP in one block for one batch; Algorithm 2: Implementations of three cases in Sec. 3.1
Open Source Code Yes Our code is available at https://github.com/wimh966/QDrop and has been integrated into MQBench (https://github.com/Model TC/MQBench).
Open Datasets Yes To investigate the influence of activation quantization when reconstructing the layer/block output, we conduct preliminary experiments on the Image Net (Russakovsky et al., 2015) dataset.
Dataset Splits Yes For Image Net dataset, we sample 1024 images as calibration set, while COCO we use 256 images. In NLP, we sample 1024 examples.
Hardware Specification No The paper mentions 'GPU effort' generally but does not specify any particular GPU models, CPU models, or other hardware components used for the experiments.
Software Dependencies No The paper states 'Our code is based on Py Torch Paszke et al. (2019)' but does not provide specific version numbers for PyTorch or other software dependencies.
Experiment Setup Yes Each block or layer output is reconstructed for 20k iterations. For Image Net dataset, we sample 1024 images as calibration set, while COCO we use 256 images. In NLP, we sample 1024 examples. We set the default dropping probability p as 0.5, except we explicitly mention it. The weight tuning method is the same with Nagel et al. (2020); Li et al. (2021a). Each block or layer output is reconstructed for 20k iterations. ... Hyper-parameters we keep it as BRECQ, such as batch size 32, learning rate for activation step size 4e-5, learning rate for weight tuning 1e-3, iterations 20000. ... Parameters about resolution is set to 800 (max size 1333) and 600 (max size 1000) for Res Nets and Mobile Net V2, respectively and batch size is set to 2 while others are the same with classification task. ... We keep the maximum sequence length to be 128 for GLUE benchmark but maximum sequence length 384 with doc stride 128 for SQu AD1.1.