QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization
Authors: Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, Fengwei Yu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various tasks including computer vision (image classification, object detection) and natural language processing (text classification and question answering) prove its superiority. |
| Researcher Affiliation | Collaboration | Xiuying Wei1, 2 , Ruihao Gong1, 2 , Yuhang Li2, Xianglong Liu1 , Fengwei Yu2 1State Key Lab of Software Development Environment, Beihang University, 2Sense Time Research |
| Pseudocode | Yes | Algorithm 1: QDROP in one block for one batch; Algorithm 2: Implementations of three cases in Sec. 3.1 |
| Open Source Code | Yes | Our code is available at https://github.com/wimh966/QDrop and has been integrated into MQBench (https://github.com/Model TC/MQBench). |
| Open Datasets | Yes | To investigate the influence of activation quantization when reconstructing the layer/block output, we conduct preliminary experiments on the Image Net (Russakovsky et al., 2015) dataset. |
| Dataset Splits | Yes | For Image Net dataset, we sample 1024 images as calibration set, while COCO we use 256 images. In NLP, we sample 1024 examples. |
| Hardware Specification | No | The paper mentions 'GPU effort' generally but does not specify any particular GPU models, CPU models, or other hardware components used for the experiments. |
| Software Dependencies | No | The paper states 'Our code is based on Py Torch Paszke et al. (2019)' but does not provide specific version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | Each block or layer output is reconstructed for 20k iterations. For Image Net dataset, we sample 1024 images as calibration set, while COCO we use 256 images. In NLP, we sample 1024 examples. We set the default dropping probability p as 0.5, except we explicitly mention it. The weight tuning method is the same with Nagel et al. (2020); Li et al. (2021a). Each block or layer output is reconstructed for 20k iterations. ... Hyper-parameters we keep it as BRECQ, such as batch size 32, learning rate for activation step size 4e-5, learning rate for weight tuning 1e-3, iterations 20000. ... Parameters about resolution is set to 800 (max size 1333) and 600 (max size 1000) for Res Nets and Mobile Net V2, respectively and batch size is set to 2 while others are the same with classification task. ... We keep the maximum sequence length to be 128 for GLUE benchmark but maximum sequence length 384 with doc stride 128 for SQu AD1.1. |