Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Fully FP8 GEMM LLM Training at Scale

Authors: Alejandro Hernández Cano, Dhia Garbaya, Imanol Schlag, Martin Jaggi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive experiments to verify our architecture across several scales. We use the Fine Web-Edu [23] text corpus... Our hardware infrastructure consists of nodes with 4 Nvidia Grace Hopper GPUs each. Our distributed training framework is adapted from Megatron-LM [29], which uses Transformer Engine [2] FP8 recipes. ...We compare our proposals with the higher-precision Llama3 baseline across a wide range of standard benchmarks to measure their downstream performance.
Researcher Affiliation	Academia	Alejandro Hernández-Cano EPFL EMAIL Dhia Garbaya EPFL EMAIL Imanol Schlag ETHZ EMAIL Martin Jaggi EPFL EMAIL
Pseudocode	No	The paper includes architectural diagrams (Figure 1) and mathematical formulations (Section B Architectures), but does not present structured pseudocode blocks or algorithms with numbered steps.
Open Source Code	Yes	We make our implementation, along with detailed steps for our experiments, public under the repository https://github.com/anonymous4375934/FOG.
Open Datasets	Yes	We use the Fine Web-Edu [23] text corpus, filtering out any web opt-out domains with robots.txt, resulting in a rigorous data-compliant corpus [5].
Dataset Splits	No	The paper mentions 'We keep a consistent context length of 4096 during all main experiments.' and trains models for different 'token counts', but does not specify explicit training/validation/test dataset splits with percentages or sample counts for the data corpus used.
Hardware Specification	Yes	Our hardware infrastructure consists of nodes with 4 Nvidia Grace Hopper GPUs each.
Software Dependencies	No	Our distributed training framework is adapted from Megatron-LM [29], which uses Transformer Engine [2] FP8 recipes. While these are software names, no specific version numbers are provided.
Experiment Setup	Yes	We detail the selection of hyperparameters used in Table 6. For the case of FOG-flash, the α0 initialization value of tanhα entropy-regularization is 0.5 for all model sizes. All models use a linear warmup schedule, and 1-sqrt cooldown schedule. ... Hyperparameter 390M 1.5B 8B Layers (L) 16 16 32 Hidden size (D) 1024 2048 4096 FFN hidden size 4096 8192 14336 Attention heads 8 16 32 QK groups 4 8 8 Softmax scale* (s) 0.17678 0.125 Tied embeddings Yes No Weight decay (λ) 0.1 Adam W β1 0.9 Adam W β2 0.95 Gradient clip value 1.0 Context length T 4096 Global batch size 128 256 512 Total training steps 100,000 125,000 10,000 Peak learning rate (η) 10−3 2.5 10−4 1.5 10−4 Warmup η steps 5,000 2,500 1,250 Cooldown η steps 20,000 25,000 N/A Minimum η 10−8