Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FP4 All the Way: Fully Quantized Training of Large Language Models
Authors: Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 1T tokens. [...] We successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. |
| Researcher Affiliation | Collaboration | Nvidia, Israel Department of Electrical and Computer Engineering Technion, Haifa, Israel |
| Pseudocode | No | The paper describes the mathematical operations for forward, backward, and update passes using equations (1) to (6) and provides theoretical derivations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way. |
| Open Datasets | Yes | We trained the models on the open-source Red Pajama dataset [5] for 1T tokens, maintaining hyperparameters consistent with [18], including train-test split and initialization. |
| Dataset Splits | Yes | We trained the models on the open-source Red Pajama dataset [5] for 1T tokens, maintaining hyperparameters consistent with [18], including train-test split and initialization. |
| Hardware Specification | Yes | All training was conducted on 256 Intel Gaudi2 devices, during 30 days. |
| Software Dependencies | No | Specifically, we used Adam W optimizer with β1 = 0.9, β2 = 0.95. We used cosine learning rate schedule, with 2000 steps of warmup, peak learning rate of 3 10 4 and decay to 0.1 of the peak learning rate. The paper specifies hyperparameters but not specific software library versions (e.g., PyTorch version, Python version, CUDA version) needed for replication. |
| Experiment Setup | Yes | Specifically, we used Adam W optimizer with β1 = 0.9, β2 = 0.95. We used cosine learning rate schedule, with 2000 steps of warmup, peak learning rate of 3 10 4 and decay to 0.1 of the peak learning rate. We used a global batch-size of 4M tokens. |