FP8 Quantization: The Power of the Exponent

Authors: Andrey Kuzmin, Mart van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, Tijmen Blankevoort

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the effect of the quantization formats on neural network quantization on three levels: 1) Analytically for several common data and weight distributions, 2) practically in INT8 and FP8 post-training quantization (PTQ) settings, and 3) in quantization-aware training (QAT) settings with both INT8 and different FP8 formats. We will show there is a strong agreement between our theoretical results and our practical results on real networks.
Researcher Affiliation Industry Qualcomm AI Research {akuzmin,mart,ren,markusn,jpeters,tijmen}@qti.qualcomm.com
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No 1Code will be made available at https://github.com/Qualcomm-AI-research/FP8-quantization
Open Datasets Yes We experiment on Res Net18 [19], Mobile Net V2 [38], and Vi T [14] for Image Net classification [37]; BERT-base [12] for language understanding on the GLUE benchmark [43]; HRNet [39] for semantic segmentation on the Cityscapes dataset [10]; Deep Lab V3 [7] for semantic segmentation on the Pascal VOC dataset [16]; and Salsa Next [11] for LIDAR point cloud segmentation on the Semantic KITTI dataset [2].
Dataset Splits Yes Following [35] we do not apply batch normalization folding, and re-estimate the batch normalization statistics (running mean and variance) before final validation, as this improved results for every model we considered.
Hardware Specification Yes Our code is written in Py Torch and all our experiments are performed using NVIDIA Tesla V100 and A100 GPUs.
Software Dependencies No The paper states "Our code is written in Py Torch" but does not specify a version number or other software dependencies with versions.
Experiment Setup Yes We train our models for 20 epochs and use Adam for the model parameters and SGD for the quantization parameters. We run experiments with various learning rates for model and quantization parameters, as well as per-tensor and per-channel quantization, and report results for the best learning setup.