Oscillation-free Quantization for Low-bit Vision Transformers
Authors: Shih-Yang Liu, Zechun Liu, Kwang-Ting Cheng
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that these proposed techniques successfully abate weight oscillation and consistently achieve substantial accuracy improvement on Image Net. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology 2Reality Labs, Meta Inc. |
| Pseudocode | Yes | Algorithm 1 Confidence-Guided Annealing |
| Open Source Code | Yes | Code and models are available at: https://github.com/nbasyl/OFQ. |
| Open Datasets | Yes | In this section, we evaluate our proposed methods on the Dei T-tiny, Dei T-small (Touvron et al., 2021) and Swin-tiny (Liu et al., 2021a) architectures on ILSVRC12 Image Net classification dataset (Krizhevsky et al., 2017). |
| Dataset Splits | Yes | Table 1: Noise injection analysis on quantized Dei T-T. The 1st, 6th, and 11th row are the accuracies of the converged model before the noise injection. Random refers to random position noise injection. within BR refers to injecting noise only to weights within the boundary. % of weight refers to the fraction of the weights with noise injected. µ and σ denote the mean and variance over ten experiment trials. Method Val Top-1 % of weights |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions general software like PyTorch implicitly through the GitHub link, but it does not provide specific version numbers for any key software components or libraries. |
| Experiment Setup | Yes | Our quantized models are trained for 300 epochs with knowledge distillation using the corresponding full-precision models as the teacher models and as initialization. For quantized Dei T-T, 2-bit Dei T-S and 2-bit Swin-T, the training setting follows that of Dei T (Touvron et al., 2021) while without mixup/cutmix (Zhang et al., 2018b; Yun et al., 2019) data augmentation. For 3-bit/4-bit quantized Dei T-S and Swin-T, we follow the training recipe in (Li et al., 2022a;b). The number of annealing epochs is set to 25 for fine-tuning the optimized model with CGA. We apply 8-bit quantization for the first (patch embedding) layer and the last (classification and distillation) layers following (Esser et al., 2020; Li et al., 2022a;b). |