Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Fully FP8 GEMM LLM Training at Scale

Authors: Alejandro Hernández Cano, Dhia Garbaya, Imanol Schlag, Martin Jaggi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments to verify our architecture across several scales. We use the Fine Web-Edu [23] text corpus... Our hardware infrastructure consists of nodes with 4 Nvidia Grace Hopper GPUs each. Our distributed training framework is adapted from Megatron-LM [29], which uses Transformer Engine [2] FP8 recipes. ...We compare our proposals with the higher-precision Llama3 baseline across a wide range of standard benchmarks to measure their downstream performance.
Researcher Affiliation Academia Alejandro Hernández-Cano EPFL EMAIL Dhia Garbaya EPFL EMAIL Imanol Schlag ETHZ EMAIL Martin Jaggi EPFL EMAIL
Pseudocode No The paper includes architectural diagrams (Figure 1) and mathematical formulations (Section B Architectures), but does not present structured pseudocode blocks or algorithms with numbered steps.
Open Source Code Yes We make our implementation, along with detailed steps for our experiments, public under the repository https://github.com/anonymous4375934/FOG.
Open Datasets Yes We use the Fine Web-Edu [23] text corpus, filtering out any web opt-out domains with robots.txt, resulting in a rigorous data-compliant corpus [5].
Dataset Splits No The paper mentions 'We keep a consistent context length of 4096 during all main experiments.' and trains models for different 'token counts', but does not specify explicit training/validation/test dataset splits with percentages or sample counts for the data corpus used.
Hardware Specification Yes Our hardware infrastructure consists of nodes with 4 Nvidia Grace Hopper GPUs each.
Software Dependencies No Our distributed training framework is adapted from Megatron-LM [29], which uses Transformer Engine [2] FP8 recipes. While these are software names, no specific version numbers are provided.
Experiment Setup Yes We detail the selection of hyperparameters used in Table 6. For the case of FOG-flash, the α0 initialization value of tanhα entropy-regularization is 0.5 for all model sizes. All models use a linear warmup schedule, and 1-sqrt cooldown schedule. ... Hyperparameter 390M 1.5B 8B Layers (L) 16 16 32 Hidden size (D) 1024 2048 4096 FFN hidden size 4096 8192 14336 Attention heads 8 16 32 QK groups 4 8 8 Softmax scale* (s) 0.17678 0.125 Tied embeddings Yes No Weight decay (λ) 0.1 Adam W β1 0.9 Adam W β2 0.95 Gradient clip value 1.0 Context length T 4096 Global batch size 128 256 512 Total training steps 100,000 125,000 10,000 Peak learning rate (η) 10−3 2.5 10−4 1.5 10−4 Warmup η steps 5,000 2,500 1,250 Cooldown η steps 20,000 25,000 N/A Minimum η 10−8