Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing Optimizer Stability: Momentum Adaptation of The NGN Step-size

Authors: Rustem Islamov, Niccolò Ajroldi, Antonio Orvieto, Aurelien Lucchi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The theoretical results are supported by extensive empirical validation in various deep learning settings where we demonstrate that NGN-M and NGN-MD not only preserve the robustness property of the NGN step-size, but improve it further in many cases. LR hyperparameter resilience comes together with better or comparable performance to state-of-the-art algorithms.
Researcher Affiliation Academia Rustem Islamov1 Niccoló Ajroldi2 Antonio Orvieto2,3,4 Aurelien Lucchi1 1University of Basel 2Max Planck Institute for Intelligent Systems 3 ELLIS Institute Tübingen 4Tübingen AI Center
Pseudocode Yes Algorithm 1 NGN-M 1: Input: x 1 = x0 Rd, step-size hyperparameter c > 0, momentum parameter β [0, 1) 2: for k = 0, 1, . . . , K 1 do 3: Sample Sk [n] 4: γk = c 1+ c 2f Sk (xk) f Sk (xk) 2 5: xk+1 = xk (1 β)γk f Sk(xk) + β(xk xk 1) 6: end for
Open Source Code Yes We provide the Pytorch-based implementation in the supplementary.
Open Datasets Yes The tests include the training of Resnet20 [29] and Vi T [17] on the CIFAR10 dataset [45], and Resnet110 on CIFAR100. Second, we test the performance of NGN-MD against Adam and Momo-Adam that contrary to NGN-M both use component-wise preconditioning.
Dataset Splits Yes We report validation perplexity on a separate subset of Slim-Pajama consisting of 10M tokens. The total compute is estimated following Kaplan et al. [41], where the estimated number of floating-point operations (FLOPs) is 6 Number of Parameters Number of Tokens.
Hardware Specification Yes Experiments of small and middle size are performed on 1x RTX 4090. We perform Image Net32 experiments on 2x A100-40GB, and Image Net1k experiments on 4x A100-SXM4-40GB. For pretraining Transformers on Language Modeling, we employ 8x H100-HBM3-80GB GPUs.
Software Dependencies No We use Py Torch [69] implementation of Adam. The implementation of Mom SPS, Momo, Momo-Adam are provided in the corresponding papers.
Experiment Setup Yes The detailed experiment setup, including the choice of hyperparameters as well as additional experimental results and details, can be found in Appendix I. The best performance of algorithms is reported in Tables 7 (momentum-based algorithms), 8 (algorithms with momentum and component-wise step-size), and 9 (algorithms with component-wise step-size). For clarity and quick reference, all links to the paper s empirical results are summarized in Table 6, while Appendix I provides additional details about the training and tokenization pipeline.