Ultra-Low Precision 4-bit Training of Deep Neural Networks

Authors: Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, Kailash Gopalakrishnan

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the robustness of the proposed 4-bit training scheme, we examined the impact of using INT4 weights and activations and FP4 gradients on a spectrum of computer vision models on the CIFAR10 [38] and Image Net [39] datasets, as summarized in Tables 1 and 2 respectively. These emulation results were performed using a custom-modified Py Torch framework that implemented all of the precisions and schemes discussed in the paper (details in the Appendix-A).
Researcher Affiliation Industry IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA
Pseudocode Yes Figure 3: Grad Scale a per-layer trainable scaling factor: (a) Grad Scale Update Algorithm
Open Source Code No The paper mentions using a 'custom-modified Py Torch framework' but does not state that the code for their specific methodology is open-sourced or available.
Open Datasets Yes To demonstrate the robustness of the proposed 4-bit training scheme, we examined the impact of using INT4 weights and activations and FP4 gradients on a spectrum of computer vision models on the CIFAR10 [38] and Image Net [39] datasets, as summarized in Tables 1 and 2 respectively.
Dataset Splits No The paper mentions using 'default network architectures' and datasets like CIFAR10 and Image Net, but does not explicitly specify the exact training, validation, and test splits (e.g., percentages or counts) used for reproduction.
Hardware Specification No The paper discusses general hardware accelerators like GPUs and TPUs and references hardware design costs, but it does not specify the exact models or configurations of GPUs, CPUs, or other hardware used to run their experiments (e.g., 'NVIDIA A100', 'Tesla V100').
Software Dependencies No The paper mentions using a 'custom-modified Py Torch framework', but does not provide specific version numbers for PyTorch or any other software libraries or dependencies used in their experiments.
Experiment Setup No The paper states, 'For all of these models, we used default network architectures, pre-processing techniques, hyper-parameters and optimizers with 4-bit training.' However, it does not provide concrete numerical values for these hyperparameters (e.g., learning rate, batch size, number of epochs) or specific optimizer settings, which are crucial for replicating the experimental setup.