Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FP4 All the Way: Fully Quantized Training of Large Language Models

Authors: Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 1T tokens. [...] We successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training.
Researcher Affiliation	Collaboration	Nvidia, Israel Department of Electrical and Computer Engineering Technion, Haifa, Israel
Pseudocode	No	The paper describes the mathematical operations for forward, backward, and update passes using equations (1) to (6) and provides theoretical derivations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way.
Open Datasets	Yes	We trained the models on the open-source Red Pajama dataset [5] for 1T tokens, maintaining hyperparameters consistent with [18], including train-test split and initialization.
Dataset Splits	Yes	We trained the models on the open-source Red Pajama dataset [5] for 1T tokens, maintaining hyperparameters consistent with [18], including train-test split and initialization.
Hardware Specification	Yes	All training was conducted on 256 Intel Gaudi2 devices, during 30 days.
Software Dependencies	No	Specifically, we used Adam W optimizer with β1 = 0.9, β2 = 0.95. We used cosine learning rate schedule, with 2000 steps of warmup, peak learning rate of 3 10 4 and decay to 0.1 of the peak learning rate. The paper specifies hyperparameters but not specific software library versions (e.g., PyTorch version, Python version, CUDA version) needed for replication.
Experiment Setup	Yes	Specifically, we used Adam W optimizer with β1 = 0.9, β2 = 0.95. We used cosine learning rate schedule, with 2000 steps of warmup, peak learning rate of 3 10 4 and decay to 0.1 of the peak learning rate. We used a global batch-size of 4M tokens.