Overcoming Oscillations in Quantization-Aware Training
Authors: Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, Tijmen Blankevoort
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we delve deeper into the phenomenon of weight oscillations and show that it can lead to a significant accuracy degradation due to wrongly estimated batch-normalization statistics during inference and increased noise during training. ... Finally, we propose two novel QAT algorithms to overcome oscillations during training: oscillation dampening and iterative weight freezing. We demonstrate that our algorithms achieve state-of-the-art accuracy for low-bit (3 & 4 bits) weight and activation quantization of efficient architectures, such as Mobile Net V2, Mobile Net V3, and Efficent Net-lite on Image Net. |
| Researcher Affiliation | Industry | 1Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc. Correspondence to: Markus Nagel <markusn@qti.qualcomm.com>, Marios Fournarakis <mfournar@qti.qualcomm.com>. |
| Pseudocode | Yes | Algorithm 1 QAT with iterative weight freezing |
| Open Source Code | Yes | Our source code is available at https:// github.com/qualcomm-ai-research/ oscillations-qat. |
| Open Datasets | Yes | In this section, we evaluate the effectiveness of our proposed methods for overcoming oscillations and compare them against other QAT methods on Image Net (Russakovsky et al., 2015). |
| Dataset Splits | Yes | We train for 20 epochs with only weight quantization for the ablation studies. For weight and activation quantization in section 5.3, we train all models for 90 epochs. ... Validation accuracy (%) on Image Net before (pre-BN) and after BN re-estimation (post-BN) for networks trained using low bit weight quantization (Esser et al., 2020) for 20 epochs. |
| Hardware Specification | No | The paper does not specify any particular hardware (GPU models, CPU models, or specific cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper describes optimization techniques and frameworks (LSQ-type, SGD), but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We use SGD with a momentum of 0.9 and train using a cosine annealing learning-rate decay. We train for 20 epochs with only weight quantization for the ablation studies. For weight and activation quantization in section 5.3, we train all models for 90 epochs. Depending on the network and quantization bit-width we train with a learning rate of either 0.01 or 0.0033. |