Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training

Authors: Charbel Sakr, Steve Dai, Rangha Venkatesan, Brian Zimmer, William Dally, Brucek Khailany

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, OCTAV-enabled QAT achieves state-of-the-art accuracy on multiple tasks. These include training-from-scratch and retraining Res Nets and Mobile Nets on Image Net, and Squad fine-tuning using BERT models, where OCTAV-enabled QAT consistently preserves accuracy at low precision (4-to-6-bits).
Researcher Affiliation Industry 1The authors are with NVIDIA Corporation, Santa Clara, CA 95051 USA. Correspondence to: Charbel Sakr <csakr@nvidia.com>.
Pseudocode No The paper provides mathematical formulas for its algorithm (e.g., equations 5 and 6) and describes its steps, but it does not present them in a clearly labeled "Pseudocode" or "Algorithm" block.
Open Source Code No The paper states: "Our implementations are derived from the NVIDIA Deep Learning Examples repository" (footnote 3 and 4), providing a link to a general repository of examples, rather than explicitly stating that the specific code for the methodology presented in this paper (OCTAV or MAD) is open-source or providing a direct link to it.
Open Datasets Yes We evaluate training-from-scratch and retraining QAT using Res Net (He et al., 2016) and Mobile Net (Sandler et al., 2018; Howard et al., 2019) models deployed on the Image Net (Deng et al., 2009) dataset for image classification. For fine-tuning QAT, we use BERT (Devlin et al., 2018) language models pretrained on the Wikipedia (Wikimedia Foundation, 2021) and Book Corpus (Zhu et al., 2015) datasets and fine-tuned on Squad v1.1 (Rajpurkar et al., 2016) for question-answering.
Dataset Splits No While the paper uses well-known datasets (ImageNet, BERT, Squad) which typically have standard splits, it does not explicitly state the train/validation/test percentages, sample counts, or provide citations to the specific predefined splits used for reproduction in the main text.
Hardware Specification Yes All models were trained using momentum SGD using 8 V100 GPUs. For fine-tuning of BERT models on Squad, we used 1 V100 GPU. The calibration was done on an Intel Xeon CPU, using the Num Py package.
Software Dependencies No The paper mentions using PyTorch operations ("our implementation only invokes native Py Torch (Paszke et al., 2017) operations") and NumPy ("using the Num Py package"), but it does not specify the version numbers for these software components.
Experiment Setup Yes Res Net training used the following parameters: an initial learning of 0.1, a per-GPU batch-size of 64, a momentum factor of 0.9, a weight decay factor of 1e-4, and a learning rate decay factor of 0.1 every 30 epochs. Res Net-50 and Res Net-18 were trained for 150 epochs, while Res Net-101 was trained for 80 epochs.