Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Authors: Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E Turner, Hao-Jun Shi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 3, we empirically demonstrate that Shampoo requires learning rate grafting in order to address the staleness and mis-scaling of the preconditioner s eigenvalues. Our empirical results show that the approximation error of the Kronecker factors evolves during training and impacts convergence, depending on both the training stage and parameter s properties. We ablate different variants of Shampoo on the Imagewoof dataset with vision transformer (Vi T) and Conv Ne Xt V2 models. We plot the final training loss after 100 epochs against the learning rate. Following the specification of the Algo Perf Shampoo submission, we update the preconditioner every 100 steps when grafting the learning rate from Adam, and the eigenbasis when using eigenvalue correction. Our results are presented in Figure 2.
Researcher Affiliation Collaboration 1Department of Engineering, University of Cambridge 2Fundamental AI Research, Meta Superintelligence Labs, Meta Platforms, Inc. 3The Alan Turing Institute 4Infrastructure Optimizations, Meta Superintelligence Labs, Meta Platforms, Inc.
Pseudocode Yes In this section, we provide the pseudocode for all algorithms, including idealized (Algorithm 1) and practical (Algorithm 2) eigenvalue-corrected Shampoo, Shampoo with Adam grafting (Algorithm 3), and the adaptive warm-started QR algorithm (Algorithm 4).
Open Source Code Yes The implementation of EShampoo and all other Shampoo variants considered here including Algorithm 4 and Equation (11) for eigh is available at https://github.com/facebookresearch/optimizers.
Open Datasets Yes We use the Imagewoof dataset.13 The Imagewoof dataset is available at https://github.com/fastai/imagenette. The Fast MRI dataset can be attributed to Knoll et al. (2020); Zbontar et al. (2019), the Image Net dataset to Krizhevsky et al. (2012), and the OGBG dataset to Hu et al. (2021).
Dataset Splits No We follow the standard Algo Perf setup and consider wall-clock time to pre-specified validation metric targets. See Dahl et al. (2023) and Kasimbeg et al. (2025) for more details on the Algo Perf benchmark. We use the Imagewoof dataset.
Hardware Specification Yes For all experiments we used 1 NVIDIA A100 80GB GPU per run, with the exception of the Image Net Vi T experiments, for which we used 4 NVIDIA A100 80GB GPUs per run.
Software Dependencies No computes eigendecompositions with torch.linalg.eigh (shortened as eigh) whenever Equation (11) does not hold. The implementation of EShampoo and all other Shampoo variants considered here including Algorithm 4 and Equation (11) for eigh is available at https://github.com/facebookresearch/optimizers.
Experiment Setup Yes All models are trained with cross entropy loss for 100 epochs, using a learning rate schedule consisting of a linear warmup for 353 steps followed by cosine decay. We use a batch size of 128, randomized cropping and horizontal flips as data augmentation, and the default settings for β1 = 0.9 and β2 = 0.999. For EShampoo in Figure 3, Figure 4, and Figure 5 we use the learning rate α = 6 10 4 and ϵ = 10 10.