Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks

Authors: Charbel Sakr, Naigang Wang, Chia-Yu Chen, Jungwook Choi, Ankur Agrawal, Naresh Shanbhag, Kailash Gopalakrishnan

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our analysis to three benchmark networks: CIFAR-10 Res Net 32, Image Net Res Net 18 and Image Net Alex Net. In each case, with accumulation precision set in accordance with our proposed equations, the networks successfully converge to the single precision floating-point baseline. We also show that reducing accumulation precision further degrades the quality of the trained network, proving that our equations produce tight bounds. and 5 NUMERICAL RESULTS
Researcher Affiliation Collaboration Charbel Sakr , Naigang Wang , Chia-Yu Chen , Jungwook Choi , Ankur Agrawal , Naresh Shanbhag , Kailash Gopalakrishnan Dept. of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign {sakr2,shanbhag}@illinois.edu IBM T.J. Watson Research Center {nwang,cchen,choij,ankuragr,kailash}@us.ibm.com
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not provide any statement about making its source code available or a link to a repository.
Open Datasets Yes Using the above analysis, we predict the mantissa precisions required by the three GEMM functions for training the following networks: Res Net 32 on the CIFAR-10 dataset, Res Net 18 and Alex Net on the Image Net dataset. Those benchmarks were chosen due to both their popularity and topologies which present large accumulation lengths, making them good candidates against which we can verify our work. We use the same configurations as (Wang et al., 2018)...
Dataset Splits Yes CIFAR-10 Res Net 32, Image Net Res Net 18 and Image Net Alex Net. and the converged test error is close to the baseline (no more than 0.5% degradation). These datasets have standard, well-defined splits.
Hardware Specification No The paper does not provide specific details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions 'CUDA code' but does not specify any software names with version numbers, such as PyTorch, TensorFlow, or specific CUDA versions.
Experiment Setup Yes We use the same configurations as (Wang et al., 2018), in particular, we use 6-b of exponents in the accumulations, and quantize the intermediate tensors to (1,5,2) floating-point format and keep the final layer s precision in 16 bit. The technique of loss scaling (Micikevicius et al., 2017) is used in order to limit underflows of activation gradients. A single scaling factor of 1000 was used for all models tested.