Outlier-aware Slicing for Post-Training Quantization in Vision Transformer

Authors: Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, Rongrong Ji

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate the impact of reconstruction granularity on quantization performance across various models using the Image Net dataset. Notably, with a 4/4 bit quantization on Dei T-tiny, we attain a Top1 accuracy of 66.31%. Furthermore, our approach achieves a Top-1 accuracy of 80.50% on Vi T-small, surpassing Noisy Quant by a margin of 3.64% (80.50% versus 76.86%).
Researcher Affiliation Collaboration 1This work was done when Yuexiao Ma was intern at Byte Dance Inc. 2Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, 361005, P.R. China. 3Byte Dance Inc. 4Peng Cheng Laboratory, Shenzhen, China. 5Institute of Artificial Intelligence, Xiamen University.
Pseudocode Yes Algorithm 1 Granularity and Optimization
Open Source Code No The paper does not provide an explicit statement about releasing source code for the methodology or a link to a code repository.
Open Datasets Yes We empirically validate the impact of reconstruction granularity on quantization performance across various models using the Image Net dataset.
Dataset Splits No The paper mentions using "16 batch data for PTQ optimization" and "16 batches of 64 samples each from the training set for calibration" but does not specify a train/validation/test split with percentages or sample counts.
Hardware Specification No The paper states, "We conduct our experiments on NVIDIA Tesla", but does not specify the exact model (e.g., V100, A100), which is required for a specific hardware detail.
Software Dependencies No The paper mentions referring to settings of methods like Adaround, BRECQ, and QDrop, but it does not list specific software dependencies with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes For the hyper-parameter settings of the optimization parameters, such as reconstruction iteration, learning rate, etc., we refer to the default settings of the above methods and keep them consistent. Please refer to Appendix F for details. ... We use 16 batches of 64 samples each from the training set for calibration. The learning rates are set at 1e-3 for the rounding parameter and 4e-5 for the quantization scale of the activation layer. The rounding loss rate is set at 0.1, with 20,000 iterations per optimization block. The activation value drop probability is 50%. We gradually reduce the power β of the progressive soft function from 20 to 2.