Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

Authors: Moritz Haas, Sebastian Bordt, Ulrike V. Luxburg, Leena Chennuru Vankadara

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments across optimizers, architectures, and data modalities, we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically maximal stable learning rate exponents which provide useful guidance on optimal learning rate exponents.
Researcher Affiliation Academia 1University of Tübingen, Tübingen AI Center EMAIL 2Gatsby Computational Neuroscience Unit, University College London EMAIL
Pseudocode No The paper defines mathematical equations for MLP architecture and training but does not include any explicitly labeled pseudocode or algorithm blocks. The 'Experiment Code' mentioned under the title is a reference to external code, not pseudocode within the document.
Open Source Code Yes Open-source code to reproduce our experiments is publicly available. Our implementation is easy to adapt and publicly available. publicly available at https://github.com/tml-tuebingen/torch-module-monitor.
Open Datasets Yes We train MLPs of varying depth up to width 16384 with plain SGD and Adam on CIFAR-10, MNIST and a generated multi-index model (reported in Appendix F). We implement our MLP experiments on MNIST (Deng, 2012) and CIFAR-10 (Krizhevsky et al., 2009) in Py Torch (Paszke et al., 2019). We train small Transformer models (Vaswani et al., 2017) using Lit GPT (Lightning AI, 2023). We train all models on the same number of tokens to prevent confounding effects from increased training time. DCLM-Baseline dataset (Li et al., 2024).
Dataset Splits Yes The training set consists of 103 training points. We also draw a test set consisting of 104 test points. All figures in this section show training loss on the left and validation loss on the right.
Hardware Specification Yes Single training runs of 8-layer MLPs of width 16384 including tracking all relevant statistics as well as our 1.4B GPT model of width 4096 run within less than 24 hours on a single Nvidia A100 GPU. We typically trained MLPs up to width 4096 on a single Nvidia Geforce Rtx 2080 Ti within less than 24 hours.
Software Dependencies No The paper mentions 'Py Torch (Paszke et al., 2019)' and 'Lit GPT (Lightning AI, 2023)', but it does not specify the version numbers for these software components. The instruction requires specific version numbers for key software components.
Experiment Setup Yes We use Adam with the Py Torch standard hyperparameters. By standard initialization we mean He initialization variance cϕ/fan_in with cϕ = 2 for the Re LU activation function (He et al., 2015). We adapt the Pythia (Biderman et al., 2023) architecture with 6 Transformer blocks, standard d_head 1/2 attention scaling, pre-attention and qk-Layernorm (Wortsman et al., 2024). We purely scale width, proportionally scaling the number of attention heads and the MLP hidden size while keeping the number of layers and head dimension d_head= 32 fixed. For widths 256, 1024 and 4096, this results in 8, 32 and 128 heads per Transformer block and a total of 30M, 167M and 1.4B parameters. Standard training means Adam W with a single, tuned maximal learning rate, (β1, β2) = (0.9, 0.95), ε = 10 12, sequence length 512, batch size 256, 700 steps of warmup followed by cosine learning rate decay to 10% of the maximal learning rate, weight decay 0.1, gradient clipping.