Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dimension-adapted Momentum Outscales SGD

Authors: Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, Courtney Paquette

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA s improved loss exponents over SGD hold in a practical setting.
Researcher Affiliation Collaboration Damien Ferbach Mila & Université de Montréal EMAIL Katie Everett Google Deep Mind & MIT EMAIL Gauthier Gidel Mila & Université de Montréal EMAIL Elliot Paquette Mc Gill University EMAIL Courtney Paquette Google Deep Mind & Mc Gill University EMAIL
Pseudocode Yes yt = (1 (t))yt 1 + γ1(t; d) PB i=1 R(θt; xi t+1), θt+1 = θt γ2(t; d) PB i=1 R(θt; xi t+1) γ3(t; d)yt, (Gen-Mom-SGD)
Open Source Code No We intend to release the code for numerically computing the ODEs.
Open Datasets Yes Finally, we train LSTMs on text data (Fig. 2) showing the DANA loss exponents (Fig. 2c & 22) vary smoothly over κ3 and recover the divergent, outscaling, and SGD-like regimes predicted theoretically by Fig. 6. L.4 Experiment Details. We use the C4 dataset processed similarly to [53] and [84]. The dataset is tokenized using SentencePiece [57] with a vocabulary size of 32k. We train with a sequence length of 1024. For evaluation, we use a held-out validation set of 1000 examples.
Dataset Splits Yes For evaluation, we use a held-out validation set of 1000 examples.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions 'compute resources provided by Mila (mila.quebec)' but without further specifics.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers. While JAX [18] and SentencePiece [57] are cited, their specific versions used in the experiments are not mentioned. For JAX, only the publication year (2018) of the framework's description is given, not a software version number.
Experiment Setup Yes We use a learning rate of 1e-4 for SGD and DANA, cosine decay with 10% warmup, weight decay of 0.01, and a batch size of 256. We train with a sequence length of 1024. Remark B.3 (Good hyperparameters for DANA-decaying, with B = 1). In Section I, we provide some heuristics as to the correct sufficient conditions for stability of DANA-decay. More precisely for any 1 κ3 > 1 2α (for DANA-decaying, one should pick κ3 = 1 2α) the conditions are stated as 2 max 2α + 2β 1 , γ1 = 1, γ2(d) = c2 Tr(D), and γ3(t; d) = cγ2(d) (1 + t)κ3 .