reproducibilityindex.ai

Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks

Authors: Charbel Sakr, Naigang Wang, Chia-Yu Chen, Jungwook Choi, Ankur Agrawal, Naresh Shanbhag, Kailash Gopalakrishnan

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our analysis to three benchmark networks: CIFAR-10 Res Net 32, Image Net Res Net 18 and Image Net Alex Net. In each case, with accumulation precision set in accordance with our proposed equations, the networks successfully converge to the single precision ﬂoating-point baseline. We also show that reducing accumulation precision further degrades the quality of the trained network, proving that our equations produce tight bounds. and 5 NUMERICAL RESULTS
Researcher Affiliation	Collaboration	Charbel Sakr , Naigang Wang , Chia-Yu Chen , Jungwook Choi , Ankur Agrawal , Naresh Shanbhag , Kailash Gopalakrishnan Dept. of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign {sakr2,shanbhag}@illinois.edu IBM T.J. Watson Research Center {nwang,cchen,choij,ankuragr,kailash}@us.ibm.com
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	The paper does not provide any statement about making its source code available or a link to a repository.
Open Datasets	Yes	Using the above analysis, we predict the mantissa precisions required by the three GEMM functions for training the following networks: Res Net 32 on the CIFAR-10 dataset, Res Net 18 and Alex Net on the Image Net dataset. Those benchmarks were chosen due to both their popularity and topologies which present large accumulation lengths, making them good candidates against which we can verify our work. We use the same conﬁgurations as (Wang et al., 2018)...
Dataset Splits	Yes	CIFAR-10 Res Net 32, Image Net Res Net 18 and Image Net Alex Net. and the converged test error is close to the baseline (no more than 0.5% degradation). These datasets have standard, well-defined splits.
Hardware Specification	No	The paper does not provide specific details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions 'CUDA code' but does not specify any software names with version numbers, such as PyTorch, TensorFlow, or specific CUDA versions.
Experiment Setup	Yes	We use the same conﬁgurations as (Wang et al., 2018), in particular, we use 6-b of exponents in the accumulations, and quantize the intermediate tensors to (1,5,2) ﬂoating-point format and keep the ﬁnal layer s precision in 16 bit. The technique of loss scaling (Micikevicius et al., 2017) is used in order to limit underﬂows of activation gradients. A single scaling factor of 1000 was used for all models tested.