Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Dimension-adapted Momentum Outscales SGD
Authors: Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, Courtney Paquette
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA s improved loss exponents over SGD hold in a practical setting. |
| Researcher Affiliation | Collaboration | Damien Ferbach Mila & Université de Montréal EMAIL Katie Everett Google Deep Mind & MIT EMAIL Gauthier Gidel Mila & Université de Montréal EMAIL Elliot Paquette Mc Gill University EMAIL Courtney Paquette Google Deep Mind & Mc Gill University EMAIL |
| Pseudocode | Yes | yt = (1 (t))yt 1 + γ1(t; d) PB i=1 R(θt; xi t+1), θt+1 = θt γ2(t; d) PB i=1 R(θt; xi t+1) γ3(t; d)yt, (Gen-Mom-SGD) |
| Open Source Code | No | We intend to release the code for numerically computing the ODEs. |
| Open Datasets | Yes | Finally, we train LSTMs on text data (Fig. 2) showing the DANA loss exponents (Fig. 2c & 22) vary smoothly over κ3 and recover the divergent, outscaling, and SGD-like regimes predicted theoretically by Fig. 6. L.4 Experiment Details. We use the C4 dataset processed similarly to [53] and [84]. The dataset is tokenized using SentencePiece [57] with a vocabulary size of 32k. We train with a sequence length of 1024. For evaluation, we use a held-out validation set of 1000 examples. |
| Dataset Splits | Yes | For evaluation, we use a held-out validation set of 1000 examples. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions 'compute resources provided by Mila (mila.quebec)' but without further specifics. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. While JAX [18] and SentencePiece [57] are cited, their specific versions used in the experiments are not mentioned. For JAX, only the publication year (2018) of the framework's description is given, not a software version number. |
| Experiment Setup | Yes | We use a learning rate of 1e-4 for SGD and DANA, cosine decay with 10% warmup, weight decay of 0.01, and a batch size of 256. We train with a sequence length of 1024. Remark B.3 (Good hyperparameters for DANA-decaying, with B = 1). In Section I, we provide some heuristics as to the correct sufficient conditions for stability of DANA-decay. More precisely for any 1 κ3 > 1 2α (for DANA-decaying, one should pick κ3 = 1 2α) the conditions are stated as 2 max 2α + 2β 1 , γ1 = 1, γ2(d) = c2 Tr(D), and γ3(t; d) = cγ2(d) (1 + t)κ3 . |