Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Authors: Roberto Castro, Andrei Panferov, Rush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an optimal technique in terms of accuracyvs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.
Researcher Affiliation Collaboration Roberto L. Castro ISTA & Red Hat AI Andrei Panferov ISTA Soroush Tabesh ISTA Oliver Sieberling ETH Zürich Jiale Chen ISTA Mahdi Nikdan ISTA Saleh Ashkboos ETH Zürich Dan Alistarh ISTA & Red Hat AI
Pseudocode Yes A.1 Algorithm Algorithm 1 Quartet MXFP4 Forward-Backward Algorithm
Open Source Code Yes Our code is available at https://github.com/IST-DASLab/Quartet.
Open Datasets Yes We pre-train Transformers [51] of the Llama-2 [46] architecture in the range of 30, 50, 100, 200 million non-embedding parameters across a wide range of data-to-parameter ratios raging from 25x (around compute-optimal [27]) to 800x (extreme data saturation). We additionally selectively scale the model size up to around 7 billion parameters to verify training stability. We train all models on the train split of the C4 [19] dataset and report C4 validation loss as the main metric.
Dataset Splits Yes We train all models on the train split of the C4 [19] dataset and report C4 validation loss as the main metric.
Hardware Specification Yes Our key technical contribution is a complex, highly-efficient GPU implementation of Quartet specialized to the new Blackwell architecture. ...on an NVIDIA Blackwell RTX 5090 GPU. The speedup results were obtained on a consumer-grade NVIDIA RTX5090 GPU with total runtime of under 1 hour. The pre-training experiments were conducted on datacenter-grade machines with 8x H100 NVIDIA GPUs for a total compute of around 6,000 GPU-hours.
Software Dependencies Yes Our fast implementation builds on CUTLASS 3.9 [45], which provides templates for the new Blackwell architecture.
Experiment Setup Yes We use the Adam W optimizer [32] with weight decay of 0.1, gradient clipping of 1.0, a 10% LR warmup and cosine schedule. We identify the optimal LR for one of the small unquantized baseline models, scale it inverse-proportionally to the number of non-embedding parameters and reuse for every quantization scheme we evaluate. We present all hyper-parameters in Appendix A.2. Table 4: Model-specific hyperparameters used in our experiments. Table 5: Common hyperparameters used across all model sizes and quantization setups.