Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
In Search of Adam’s Secret Sauce
Authors: Antonio Orvieto, Robert Gower
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we conduct an extensive empirical study training over 1,500 language models across different data configurations and scales comparing Adam to several known simplified variants. We perform a large-scale evaluation ( 10 thousand NVIDIA A100-SXM4-80GB GPU hours) of the performance of established algorithms which claim a theoretical or empirical similarity/dissimilarity with Adam on 160M parameters LMs with usual configurations [Biderman et al., 2023, Black et al., 2022], at a compute-optimal budget on different datasets, at different batch-sizes and sequence lengths (up to 2048 tokens). |
| Researcher Affiliation | Academia | Antonio Orvieto ELLIS Institute Tübingen, MPI-IS Tübingen AI Center, Germany EMAIL. Robert M. Gower CCM, Flatiron Institute, Simons Foundation New York, US |
| Pseudocode | No | The paper describes mathematical formulations of algorithms like Adam and Signum, and a theoretical interpretation of Adam, but it does not present any formal pseudocode or algorithm blocks describing its methodology. |
| Open Source Code | Yes | We make all of our data, e.g. loss dynamics for all our settings, publicly available at https://github.com/aorvieto/Secret Sauce. (From NeurIPS Paper Checklist 5. Open access to data and code): 'We provide the code for reproducing our plots. We provide the data and main plots at https://github.com/aorvieto/Secret Sauce.' |
| Open Datasets | Yes | We conduct 475 compute-optimal pretraining runs on the Slim Pajama-627B dataset [Soboleva et al., 2023], using a sequence length of 2048, a batch size of 256, and a decoupled weight decay of 0.1 [Loshchilov and Hutter, 2019] (except for SGD). In Figure 18 we test both Takeaway 1 and Takeaway 2 on Fineweb [Penedo et al., 2024]. |
| Dataset Splits | Yes | We always report validation perplexity on a held-out subset of 100M tokens. We conduct 475 compute-optimal pretraining runs on the Slim Pajama-627B dataset [Soboleva et al., 2023], using a sequence length of 2048, a batch size of 256, and a decoupled weight decay of 0.1 [Loshchilov and Hutter, 2019] (except for SGD). |
| Hardware Specification | Yes | We perform a large-scale evaluation ( 10 thousand NVIDIA A100-SXM4-80GB GPU hours) of the performance of established algorithms... All our experiments at a 160M parameter scale are performed on a single NVIDIA A100-SXM4-80GB. Our runs at a 410M parameter scale are performed on 8 NVIDIA A100-SXM4-80GB GPUs, and each run here takes approximately 4.83 hours. |
| Software Dependencies | No | For pre-training Transformers on Causal Language Modeling, we build upon the nano GPT [Karpathy, 2022] implementation4 enhanced by recent advancements such as Rotational Positional Embeddings [Su et al., 2024], RMSNorm normalization [Zhang and Sennrich, 2019], and Swi GLU activation functions [Shazeer, 2020]. All our models have a vocabulary size of 50280 and make use of GPT-Neox tokenizer [Black et al., 2022]. |
| Experiment Setup | Yes | We adopt a robust training protocol inspired by successful practices established in large language models like LLa Ma [Touvron et al., 2023]... leveraging techniques including bfloat16 precision, linear warm-up followed by a cosine annealing schedule [Loshchilov and Hutter, 2016], and global gradient norm clipping (unless specified). Our model configurations follow [Biderman et al., 2023] and are presented, alongside a detailed description of all tuning settings and resources, in A. Adam W (200 runs): Tuned parameters include both momentum terms and the learning rate. (η, β1, β2) [0.016, 0.008, 0.004, 0.002, 0.001] [0.9875, 0.975, 0.95, 0.9, 0.8] [0.996875, 0.99375, 0.9875, 0.975, 0.95, 0.9, 0.8, 0.6] |