Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Authors: Roberto Castro, Andrei Panferov, Rush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an optimal technique in terms of accuracyvs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.
Researcher Affiliation	Collaboration	Roberto L. Castro ISTA & Red Hat AI Andrei Panferov ISTA Soroush Tabesh ISTA Oliver Sieberling ETH Zürich Jiale Chen ISTA Mahdi Nikdan ISTA Saleh Ashkboos ETH Zürich Dan Alistarh ISTA & Red Hat AI
Pseudocode	Yes	A.1 Algorithm Algorithm 1 Quartet MXFP4 Forward-Backward Algorithm
Open Source Code	Yes	Our code is available at https://github.com/IST-DASLab/Quartet.
Open Datasets	Yes	We pre-train Transformers [51] of the Llama-2 [46] architecture in the range of 30, 50, 100, 200 million non-embedding parameters across a wide range of data-to-parameter ratios raging from 25x (around compute-optimal [27]) to 800x (extreme data saturation). We additionally selectively scale the model size up to around 7 billion parameters to verify training stability. We train all models on the train split of the C4 [19] dataset and report C4 validation loss as the main metric.
Dataset Splits	Yes	We train all models on the train split of the C4 [19] dataset and report C4 validation loss as the main metric.
Hardware Specification	Yes	Our key technical contribution is a complex, highly-efficient GPU implementation of Quartet specialized to the new Blackwell architecture. ...on an NVIDIA Blackwell RTX 5090 GPU. The speedup results were obtained on a consumer-grade NVIDIA RTX5090 GPU with total runtime of under 1 hour. The pre-training experiments were conducted on datacenter-grade machines with 8x H100 NVIDIA GPUs for a total compute of around 6,000 GPU-hours.
Software Dependencies	Yes	Our fast implementation builds on CUTLASS 3.9 [45], which provides templates for the new Blackwell architecture.
Experiment Setup	Yes	We use the Adam W optimizer [32] with weight decay of 0.1, gradient clipping of 1.0, a 10% LR warmup and cosine schedule. We identify the optimal LR for one of the small unquantized baseline models, scale it inverse-proportionally to the number of non-embedding parameters and reuse for every quantization scheme we evaluate. We present all hyper-parameters in Appendix A.2. Table 4: Model-specific hyperparameters used in our experiments. Table 5: Common hyperparameters used across all model sizes and quantization setups.