Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales
Authors: Shikai Qiu, Charlie Chen, Hoang Phan, Qi Lei, Andrew Gordon Wilson
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to µP improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as 1/width is nearly optimal across optimizers. Applying these scaling rules, we show Muon, SOAP and Shampoo consistently achieve near 1.4 speedup over Adam W for training Llama-architecture language models of sizes ranging from 190M to 1.4B, whereas the speedup vanishes rapidly with scale under incorrect scaling. |
| Researcher Affiliation | Academia | Shikai Qiu Zixi Chen Hoang Phan Qi Lei Andrew Gordon Wilson New York University. Correspondence to EMAIL, EMAIL, EMAIL. |
| Pseudocode | No | The paper defines update rules for optimizers like Adam, Shampoo, SOAP, Muon, Ada Muon, Grafting, and Blocking in Section A. These are presented as a series of mathematical equations rather than structured pseudocode or algorithm blocks with explicit control flow statements. |
| Open Source Code | Yes | We make our code and wandb experiment logs available here. |
| Open Datasets | Yes | We train transformers on the Open Web Text dataset for 100M tokens. We use the Fine Web dataset tokenized with the GPT-2 tokenizer, and train each model for 20 tokens per parameter. |
| Dataset Splits | No | The paper mentions training on 'randomly shuffled Fine Web' and using 'Open Web Text' but does not explicitly provide details about specific training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined split references). |
| Hardware Specification | Yes | We trained all of our models on TPU-v4 and TPU-v6e, supported by the Google TPU Research Cloud program. Open Web Text experiments are trained on TPU-v4-4 and Fine Web experiments on TPU-v6e-8, TPU-v6e-16, and TPU-v4-32. |
| Software Dependencies | No | The paper mentions using a 'GPT-2 tokenizer' and 'wandb experiment logs', and references 'modded-nanogpt' for the Llama architecture. However, specific version numbers for these or other key software dependencies are not provided. |
| Experiment Setup | Yes | We use a linear decay learning rate schedule with no weight decay. We set ϵ = 10^-8 at the base model, except for Shampoo variants, which use relative ϵ = 10^-5. We use β1 = 0.9 (first moment) and β2 = 0.95 (second moment and preconditioners) for all optimizers. We use β2 = 0.98 for all experiments. Table 4 details tuning sweeps for learning rate, LR multiplier, β1, β2, warmup, and independent weight decay for each optimizer. |