Neural gradients are near-lognormal: improved quantized and sparse training

Authors: Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner, Daniel Soudry

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Each method achieves state-of-the-art results on Image Net. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity in each case without accuracy degradation. Reference implementation accompanies the paper in the supplementary material.
Researcher Affiliation Collaboration Habana Labs An Intel company, Caesarea, Israel, Department of Electrical Engineering Technion, Haifa, Israel
Pseudocode Yes Pseudo-code appears in Algorithm 1
Open Source Code Yes Reference implementation accompanies the paper in the supplementary material.
Open Datasets Yes Each method achieves state-of-the-art results on Image Net. Res Net18, Res Net101 Cifar100. Res Net18, Squeeze Net Image Net. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.
Dataset Splits Yes The validation accuracy during training for different sparsity levels and different datasets can be found in Fig. A.16. In Table 3 we show the results of different allocations between exponent and mantissa for different FP formats in Cifar100 and Image Net dataset.
Hardware Specification No The paper mentions "HW accelerator" but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for the experiments.
Software Dependencies No The paper discusses different floating-point formats and related work, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup No The paper mentions "All results were achieved using the suggested gradient scaling, where the mean is sampled once every epoch" but lacks comprehensive details on the experimental setup such as specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings.