Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

Authors: Shikai Qiu, Charlie Chen, Hoang Phan, Qi Lei, Andrew Gordon Wilson

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to µP improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as 1/width is nearly optimal across optimizers. Applying these scaling rules, we show Muon, SOAP and Shampoo consistently achieve near 1.4 speedup over Adam W for training Llama-architecture language models of sizes ranging from 190M to 1.4B, whereas the speedup vanishes rapidly with scale under incorrect scaling.
Researcher Affiliation Academia Shikai Qiu Zixi Chen Hoang Phan Qi Lei Andrew Gordon Wilson New York University. Correspondence to EMAIL, EMAIL, EMAIL.
Pseudocode No The paper defines update rules for optimizers like Adam, Shampoo, SOAP, Muon, Ada Muon, Grafting, and Blocking in Section A. These are presented as a series of mathematical equations rather than structured pseudocode or algorithm blocks with explicit control flow statements.
Open Source Code Yes We make our code and wandb experiment logs available here.
Open Datasets Yes We train transformers on the Open Web Text dataset for 100M tokens. We use the Fine Web dataset tokenized with the GPT-2 tokenizer, and train each model for 20 tokens per parameter.
Dataset Splits No The paper mentions training on 'randomly shuffled Fine Web' and using 'Open Web Text' but does not explicitly provide details about specific training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined split references).
Hardware Specification Yes We trained all of our models on TPU-v4 and TPU-v6e, supported by the Google TPU Research Cloud program. Open Web Text experiments are trained on TPU-v4-4 and Fine Web experiments on TPU-v6e-8, TPU-v6e-16, and TPU-v4-32.
Software Dependencies No The paper mentions using a 'GPT-2 tokenizer' and 'wandb experiment logs', and references 'modded-nanogpt' for the Llama architecture. However, specific version numbers for these or other key software dependencies are not provided.
Experiment Setup Yes We use a linear decay learning rate schedule with no weight decay. We set ϵ = 10^-8 at the base model, except for Shampoo variants, which use relative ϵ = 10^-5. We use β1 = 0.9 (first moment) and β2 = 0.95 (second moment and preconditioners) for all optimizers. We use β2 = 0.98 for all experiments. Table 4 details tuning sweeps for learning rate, LR multiplier, β1, β2, warmup, and independent weight decay for each optimizer.