Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding Adam Requires Better Rotation Dependent Assumptions
Authors: Tianyue Zhang, Lucas Maes, Alan Milligan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas, Damien Scieur, Simon Lacoste-Julien, Charles Guille-Escuret
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper investigates Adam s sensitivity to rotations of the parameter space. We observe that Adam s performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. Our experimental investigation reveals that Adam s performance when training transformers empirically degrades when the objective function undergoes random rotations (Figure 1). This result demonstrates that Adam s effectiveness crucially depends on the canonical basis... |
| Researcher Affiliation | Collaboration | 1 Mila, Quebec AI Institute 2 Université de Montréal 3 University of British Columbia 4 Samsung SAIL Montreal 5 Canada CIFAR AI Chair 6Archimedes Unit, Athena Research Center |
| Pseudocode | Yes | Algorithm 1 Empirical Gradient Bound Estimation for Adam Algorithm 2 SGD Momentum Optimization Algorithm Algorithm 3 Adam W Optimization Algorithm Algorithm 4 Adam W Optimization Algorithm with Rotation Algorithm 5 Adam W Optimization Algorithm with SVD Rotation Algorithm 6 Muon Optimization Algorithm |
| Open Source Code | No | The code to reproduce our experiments will be made publicly available upon publication. |
| Open Datasets | Yes | Language modeling (GPT-2, Fig. 1a): 124M-parameter decoder-only Transformer [Radford et al., 2019] trained on Open Web Text [Gokaslan and Cohen, 2019]. Image classification (Vi T, Fig. 1b): 22M-parameter Vision Transformer (Vi T/S) [Dosovitskiy et al., 2021] evaluated on Image Net-1K [Deng et al., 2009]. |
| Dataset Splits | Yes | Language modeling (GPT-2, Fig. 1a): 124M-parameter decoder-only Transformer [Radford et al., 2019] trained on Open Web Text [Gokaslan and Cohen, 2019]. Image classification (Vi T, Fig. 1b): 22M-parameter Vision Transformer (Vi T/S) [Dosovitskiy et al., 2021] evaluated on Image Net-1K [Deng et al., 2009]. Image classification (Res Net, Fig. B): Res Net-50 [He et al., 2016] on Image Net-1K, where SGD often outperforms Adam [Keskar and Socher, 2017, Wilson et al., 2017]. Figure 14: Simple Vi T Imagenet training loss, validation loss and top-1 validation accuracy Figure 15: Training loss, validation loss and top 1 % validation accuracy, when training a Res Net-50 with Adam on Image Net across different scopes of rotations. |
| Hardware Specification | Yes | All experiments were performed on four A100 80GB GPUs, leveraging mixed precision. |
| Software Dependencies | No | No specific software versions (e.g., Python, PyTorch, CUDA versions) are mentioned within the paper. |
| Experiment Setup | Yes | GPT2 (Transformer). We trained a GPT-2 model with 124M parameters on the Open Web Text dataset [Gokaslan and Cohen, 2019] using a configuration designed for efficient pretraining. The model architecture includes 12 layers, 12 attention heads, and a 768-dimensional embedding space, with no bias in Layer Norm or Linear layers. We employed the Adam W optimizer with a peak learning rate of 6 10 4, β1 = 0.9, β2 = 0.95, and a weight decay of 0.1, applying gradient clipping of 1.0. Training ran for 100,000 iterations (or 30,000 for some smaller ablations), with learning cosine rate decay starting after a 2,000-iteration warm-up, decaying to a minimum of 6 10 5. We used a sequence length of 1024 and micro batch size of 12 with gradient accumulation steps to simulate an effective batch size of 480 sequences. Vi T (Vision Transformer). We trained a Vision Transformer (Vi T) model on the Image Net-1K dataset [Deng et al., 2009] using the Simple Vi T architecture [Beyer et al., 2022]. ... The Adam W optimizer was employed with a learning rate of 0.001, β1 = 0.9, β2 = 0.999, ϵ = 10 8, and a weight decay of 0.1. We used a cosine learning rate schedule with 5 warm-up epochs. The training was conducted for 100 epochs with a batch size of 1024. Res Net-50 (CNN). We trained a Res Net-50 model [He et al., 2015] on the Image Net-1K dataset [Deng et al., 2009] using the Adam W optimizer. The optimizer was configured with a learning rate of 0.001, β1 = 0.9, β2 = 0.999, ϵ = 10 8, and a weight decay of 0.0001. We employed a cosine learning rate schedule with 5 warm-up epochs. The training ran for 100 epochs with a batch size of 256. |