Ultra-Low Precision 4-bit Training of Deep Neural Networks
Authors: Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, Kailash Gopalakrishnan
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the robustness of the proposed 4-bit training scheme, we examined the impact of using INT4 weights and activations and FP4 gradients on a spectrum of computer vision models on the CIFAR10 [38] and Image Net [39] datasets, as summarized in Tables 1 and 2 respectively. These emulation results were performed using a custom-modified Py Torch framework that implemented all of the precisions and schemes discussed in the paper (details in the Appendix-A). |
| Researcher Affiliation | Industry | IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA |
| Pseudocode | Yes | Figure 3: Grad Scale a per-layer trainable scaling factor: (a) Grad Scale Update Algorithm |
| Open Source Code | No | The paper mentions using a 'custom-modified Py Torch framework' but does not state that the code for their specific methodology is open-sourced or available. |
| Open Datasets | Yes | To demonstrate the robustness of the proposed 4-bit training scheme, we examined the impact of using INT4 weights and activations and FP4 gradients on a spectrum of computer vision models on the CIFAR10 [38] and Image Net [39] datasets, as summarized in Tables 1 and 2 respectively. |
| Dataset Splits | No | The paper mentions using 'default network architectures' and datasets like CIFAR10 and Image Net, but does not explicitly specify the exact training, validation, and test splits (e.g., percentages or counts) used for reproduction. |
| Hardware Specification | No | The paper discusses general hardware accelerators like GPUs and TPUs and references hardware design costs, but it does not specify the exact models or configurations of GPUs, CPUs, or other hardware used to run their experiments (e.g., 'NVIDIA A100', 'Tesla V100'). |
| Software Dependencies | No | The paper mentions using a 'custom-modified Py Torch framework', but does not provide specific version numbers for PyTorch or any other software libraries or dependencies used in their experiments. |
| Experiment Setup | No | The paper states, 'For all of these models, we used default network architectures, pre-processing techniques, hyper-parameters and optimizers with 4-bit training.' However, it does not provide concrete numerical values for these hyperparameters (e.g., learning rate, batch size, number of epochs) or specific optimizer settings, which are crucial for replicating the experimental setup. |