Neural gradients are near-lognormal: improved quantized and sparse training
Authors: Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner, Daniel Soudry
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Each method achieves state-of-the-art results on Image Net. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity in each case without accuracy degradation. Reference implementation accompanies the paper in the supplementary material. |
| Researcher Affiliation | Collaboration | Habana Labs An Intel company, Caesarea, Israel, Department of Electrical Engineering Technion, Haifa, Israel |
| Pseudocode | Yes | Pseudo-code appears in Algorithm 1 |
| Open Source Code | Yes | Reference implementation accompanies the paper in the supplementary material. |
| Open Datasets | Yes | Each method achieves state-of-the-art results on Image Net. Res Net18, Res Net101 Cifar100. Res Net18, Squeeze Net Image Net. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. |
| Dataset Splits | Yes | The validation accuracy during training for different sparsity levels and different datasets can be found in Fig. A.16. In Table 3 we show the results of different allocations between exponent and mantissa for different FP formats in Cifar100 and Image Net dataset. |
| Hardware Specification | No | The paper mentions "HW accelerator" but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for the experiments. |
| Software Dependencies | No | The paper discusses different floating-point formats and related work, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | No | The paper mentions "All results were achieved using the suggested gradient scaling, where the mean is sampled once every epoch" but lacks comprehensive details on the experimental setup such as specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings. |