Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks
Authors: Charbel Sakr, Naigang Wang, Chia-Yu Chen, Jungwook Choi, Ankur Agrawal, Naresh Shanbhag, Kailash Gopalakrishnan
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our analysis to three benchmark networks: CIFAR-10 Res Net 32, Image Net Res Net 18 and Image Net Alex Net. In each case, with accumulation precision set in accordance with our proposed equations, the networks successfully converge to the single precision floating-point baseline. We also show that reducing accumulation precision further degrades the quality of the trained network, proving that our equations produce tight bounds. and 5 NUMERICAL RESULTS |
| Researcher Affiliation | Collaboration | Charbel Sakr , Naigang Wang , Chia-Yu Chen , Jungwook Choi , Ankur Agrawal , Naresh Shanbhag , Kailash Gopalakrishnan Dept. of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign {sakr2,shanbhag}@illinois.edu IBM T.J. Watson Research Center {nwang,cchen,choij,ankuragr,kailash}@us.ibm.com |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper does not provide any statement about making its source code available or a link to a repository. |
| Open Datasets | Yes | Using the above analysis, we predict the mantissa precisions required by the three GEMM functions for training the following networks: Res Net 32 on the CIFAR-10 dataset, Res Net 18 and Alex Net on the Image Net dataset. Those benchmarks were chosen due to both their popularity and topologies which present large accumulation lengths, making them good candidates against which we can verify our work. We use the same configurations as (Wang et al., 2018)... |
| Dataset Splits | Yes | CIFAR-10 Res Net 32, Image Net Res Net 18 and Image Net Alex Net. and the converged test error is close to the baseline (no more than 0.5% degradation). These datasets have standard, well-defined splits. |
| Hardware Specification | No | The paper does not provide specific details such as GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions 'CUDA code' but does not specify any software names with version numbers, such as PyTorch, TensorFlow, or specific CUDA versions. |
| Experiment Setup | Yes | We use the same configurations as (Wang et al., 2018), in particular, we use 6-b of exponents in the accumulations, and quantize the intermediate tensors to (1,5,2) floating-point format and keep the final layer s precision in 16 bit. The technique of loss scaling (Micikevicius et al., 2017) is used in order to limit underflows of activation gradients. A single scaling factor of 1000 was used for all models tested. |