Outlier-aware Slicing for Post-Training Quantization in Vision Transformer
Authors: Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, Rongrong Ji
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the impact of reconstruction granularity on quantization performance across various models using the Image Net dataset. Notably, with a 4/4 bit quantization on Dei T-tiny, we attain a Top1 accuracy of 66.31%. Furthermore, our approach achieves a Top-1 accuracy of 80.50% on Vi T-small, surpassing Noisy Quant by a margin of 3.64% (80.50% versus 76.86%). |
| Researcher Affiliation | Collaboration | 1This work was done when Yuexiao Ma was intern at Byte Dance Inc. 2Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, 361005, P.R. China. 3Byte Dance Inc. 4Peng Cheng Laboratory, Shenzhen, China. 5Institute of Artificial Intelligence, Xiamen University. |
| Pseudocode | Yes | Algorithm 1 Granularity and Optimization |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the methodology or a link to a code repository. |
| Open Datasets | Yes | We empirically validate the impact of reconstruction granularity on quantization performance across various models using the Image Net dataset. |
| Dataset Splits | No | The paper mentions using "16 batch data for PTQ optimization" and "16 batches of 64 samples each from the training set for calibration" but does not specify a train/validation/test split with percentages or sample counts. |
| Hardware Specification | No | The paper states, "We conduct our experiments on NVIDIA Tesla", but does not specify the exact model (e.g., V100, A100), which is required for a specific hardware detail. |
| Software Dependencies | No | The paper mentions referring to settings of methods like Adaround, BRECQ, and QDrop, but it does not list specific software dependencies with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | For the hyper-parameter settings of the optimization parameters, such as reconstruction iteration, learning rate, etc., we refer to the default settings of the above methods and keep them consistent. Please refer to Appendix F for details. ... We use 16 batches of 64 samples each from the training set for calibration. The learning rates are set at 1e-3 for the rounding parameter and 4e-5 for the quantization scale of the activation layer. The rounding loss rate is set at 0.1, with 20,000 iterations per optimization block. The activation value drop probability is 50%. We gradually reduce the power β of the progressive soft function from 20 to 2. |