reproducibilityindex.ai

Overcoming Oscillations in Quantization-Aware Training

Authors: Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, Tijmen Blankevoort

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we delve deeper into the phenomenon of weight oscillations and show that it can lead to a significant accuracy degradation due to wrongly estimated batch-normalization statistics during inference and increased noise during training. ... Finally, we propose two novel QAT algorithms to overcome oscillations during training: oscillation dampening and iterative weight freezing. We demonstrate that our algorithms achieve state-of-the-art accuracy for low-bit (3 & 4 bits) weight and activation quantization of efficient architectures, such as Mobile Net V2, Mobile Net V3, and Efficent Net-lite on Image Net.
Researcher Affiliation	Industry	1Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc. Correspondence to: Markus Nagel <markusn@qti.qualcomm.com>, Marios Fournarakis <mfournar@qti.qualcomm.com>.
Pseudocode	Yes	Algorithm 1 QAT with iterative weight freezing
Open Source Code	Yes	Our source code is available at https:// github.com/qualcomm-ai-research/ oscillations-qat.
Open Datasets	Yes	In this section, we evaluate the effectiveness of our proposed methods for overcoming oscillations and compare them against other QAT methods on Image Net (Russakovsky et al., 2015).
Dataset Splits	Yes	We train for 20 epochs with only weight quantization for the ablation studies. For weight and activation quantization in section 5.3, we train all models for 90 epochs. ... Validation accuracy (%) on Image Net before (pre-BN) and after BN re-estimation (post-BN) for networks trained using low bit weight quantization (Esser et al., 2020) for 20 epochs.
Hardware Specification	No	The paper does not specify any particular hardware (GPU models, CPU models, or specific cloud instances) used for running the experiments.
Software Dependencies	No	The paper describes optimization techniques and frameworks (LSQ-type, SGD), but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	We use SGD with a momentum of 0.9 and train using a cosine annealing learning-rate decay. We train for 20 epochs with only weight quantization for the ablation studies. For weight and activation quantization in section 5.3, we train all models for 90 epochs. Depending on the network and quantization bit-width we train with a learning rate of either 0.01 or 0.0033.