Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
How Memory in Optimization Algorithms Implicitly Modifies the Loss
Authors: Matias Cattaneo, Boris Shigida
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations confirm our theoretical findings. ... As a preliminary empirical illustration, we train ResNet-50 [31] on CIFAR-10 [38] using Adam (with decoupled weight decay) and Lion. ... We plot in Figure 2(a) the test accuracy at a fixed small training loss threshold (controlling for training speed). ... We observe in Figure 2(b) the same trends as above (in large-batch training, higher β2 increases the best validation perplexity, that is, hurts generalization; sometimes, taking lower β2 can close the gap between Adam and Lion). |
| Researcher Affiliation | Academia | Matias D. Cattaneo Princeton University EMAIL Boris Shigida Princeton University EMAIL |
| Pseudocode | No | The paper describes various optimization algorithms (e.g., Heavy-ball momentum gradient descent, Adam W, Lion-K) using mathematical equations and textual descriptions of their iterations. However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor are the steps presented in a structured, code-like format. |
| Open Source Code | Yes | The code is available at https://github.com/borshigida/how-memory-modifies-loss. |
| Open Datasets | Yes | As a preliminary empirical illustration, we train ResNet-50 [31] on CIFAR-10 [38] using Adam (with decoupled weight decay) and Lion. ... We also observe this phenomenon on a language task by training Transformer-XL [18] on Wiki Text-2 [49]. ... CIFAR-10 is released without an explicit license. MNIST has the CC BY-SA 3.0 license. ... The Wiki Text-2 dataset is released under the CC BY-SA 3.0 license. |
| Dataset Splits | Yes | As a preliminary empirical illustration, we train ResNet-50 [31] on CIFAR-10 [38]... We plot in Figure 2(a) the test accuracy at a fixed small training loss threshold... (b) Minimal validation perplexity (before overfitting) of Transformer-XL trained with full-batch Adam on Wiki Text-2... Our implementation of ResNet-50 follows the one from [11] (small modification of the standard torchvision implementation to allow for training on CIFAR-10 rather than Image Net). |
| Hardware Specification | Yes | Each run took about 10 hours on average on one machine with a devoted 40 GB NVIDIA A100 GPU (though the training horizon was longer than necessary). This puts compute resources at around 12 * 10 * 3 = 360 A100-GPU-hours per sweep. |
| Software Dependencies | No | The paper mentions software like 'pytorch.optim' and 'torchvision' implementation, and codebase for 'transformer-xl', but does not provide specific version numbers for any of these components, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | Full-batch Adam, learning rate h = 10^-3.5, β1 = 0.99, ε = 10^-6, weight decay 0.005. For comparison, we also show Lion with the same learning rate and weight decay (with default ρ1 = 0.9, ρ2 = 0.99). (b) Minimal validation perplexity (before overfitting) of Transformer-XL trained with full-batch Adam on Wiki Text-2 with learning rate 10^-4, β1 = 0.9, ε = 10^-6. For comparison, we also show Lion (with default ρ1 = 0.9, ρ2 = 0.99). |