Overcoming Oscillations in Quantization-Aware Training

Authors: Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, Tijmen Blankevoort

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we delve deeper into the phenomenon of weight oscillations and show that it can lead to a significant accuracy degradation due to wrongly estimated batch-normalization statistics during inference and increased noise during training. ... Finally, we propose two novel QAT algorithms to overcome oscillations during training: oscillation dampening and iterative weight freezing. We demonstrate that our algorithms achieve state-of-the-art accuracy for low-bit (3 & 4 bits) weight and activation quantization of efficient architectures, such as Mobile Net V2, Mobile Net V3, and Efficent Net-lite on Image Net.
Researcher Affiliation Industry 1Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc. Correspondence to: Markus Nagel <markusn@qti.qualcomm.com>, Marios Fournarakis <mfournar@qti.qualcomm.com>.
Pseudocode Yes Algorithm 1 QAT with iterative weight freezing
Open Source Code Yes Our source code is available at https:// github.com/qualcomm-ai-research/ oscillations-qat.
Open Datasets Yes In this section, we evaluate the effectiveness of our proposed methods for overcoming oscillations and compare them against other QAT methods on Image Net (Russakovsky et al., 2015).
Dataset Splits Yes We train for 20 epochs with only weight quantization for the ablation studies. For weight and activation quantization in section 5.3, we train all models for 90 epochs. ... Validation accuracy (%) on Image Net before (pre-BN) and after BN re-estimation (post-BN) for networks trained using low bit weight quantization (Esser et al., 2020) for 20 epochs.
Hardware Specification No The paper does not specify any particular hardware (GPU models, CPU models, or specific cloud instances) used for running the experiments.
Software Dependencies No The paper describes optimization techniques and frameworks (LSQ-type, SGD), but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes We use SGD with a momentum of 0.9 and train using a cosine annealing learning-rate decay. We train for 20 epochs with only weight quantization for the ablation studies. For weight and activation quantization in section 5.3, we train all models for 90 epochs. Depending on the network and quantization bit-width we train with a learning rate of either 0.01 or 0.0033.