Oscillation-free Quantization for Low-bit Vision Transformers

Authors: Shih-Yang Liu, Zechun Liu, Kwang-Ting Cheng

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that these proposed techniques successfully abate weight oscillation and consistently achieve substantial accuracy improvement on Image Net.
Researcher Affiliation Collaboration 1Hong Kong University of Science and Technology 2Reality Labs, Meta Inc.
Pseudocode Yes Algorithm 1 Confidence-Guided Annealing
Open Source Code Yes Code and models are available at: https://github.com/nbasyl/OFQ.
Open Datasets Yes In this section, we evaluate our proposed methods on the Dei T-tiny, Dei T-small (Touvron et al., 2021) and Swin-tiny (Liu et al., 2021a) architectures on ILSVRC12 Image Net classification dataset (Krizhevsky et al., 2017).
Dataset Splits Yes Table 1: Noise injection analysis on quantized Dei T-T. The 1st, 6th, and 11th row are the accuracies of the converged model before the noise injection. Random refers to random position noise injection. within BR refers to injecting noise only to weights within the boundary. % of weight refers to the fraction of the weights with noise injected. µ and σ denote the mean and variance over ten experiment trials. Method Val Top-1 % of weights
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions general software like PyTorch implicitly through the GitHub link, but it does not provide specific version numbers for any key software components or libraries.
Experiment Setup Yes Our quantized models are trained for 300 epochs with knowledge distillation using the corresponding full-precision models as the teacher models and as initialization. For quantized Dei T-T, 2-bit Dei T-S and 2-bit Swin-T, the training setting follows that of Dei T (Touvron et al., 2021) while without mixup/cutmix (Zhang et al., 2018b; Yun et al., 2019) data augmentation. For 3-bit/4-bit quantized Dei T-S and Swin-T, we follow the training recipe in (Li et al., 2022a;b). The number of annealing epochs is set to 25 for fine-tuning the optimized model with CGA. We apply 8-bit quantization for the first (patch embedding) layer and the last (classification and distillation) layers following (Esser et al., 2020; Li et al., 2022a;b).