Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Accelerating neural network training: An analysis of the AlgoPerf competition

Authors: Priya Kasimbeg, Frank Schneider, Runa Eschenhagen, Juhan Bae, Chandramouli Shama Sastry, Mark Saroufim, BOYUAN FENG, Less Wright, Edward Yang, Zachary Nado, Sourabh Medapati, Philipp Hennig, Michael Rabbat, George Dahl

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents the inaugural ALGOPERF competition s results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key ﬁndings: (1) The winning submission in the external tuning ruleset, using DISTRIBUTED SHAMPOO, demonstrates the effectiveness of nondiagonal preconditioning over popular methods like ADAM, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the SCHEDULE FREE ADAMW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The topscoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms.
Researcher Affiliation	Collaboration	1Google Deep Mind 2University of Tübingen 3University of Cambridge 4Vector Institute 5University of Toronto 6Dalhousie University 7Meta
Pseudocode	No	The paper discusses various algorithms (e.g., DISTRIBUTED SHAMPOO, SCHEDULE FREE ADAMW, ADAM) and their performance in a competition. It describes their characteristics and results but does not provide any structured pseudocode or algorithm blocks for these methods within the text.
Open Source Code	No	The paper mentions links to the competition rules and documentation (github.com/mlcommons/algorithmic-efficiency/[...]/COMPETITION_RULES.md and DOCUMENTATION.md) and a specific pull request for a JAX LSTM layer implementation (https://github.com/google/jax/pull/13319). However, it does not contain an explicit statement or a direct link to the source code for the specific methodology or analysis presented in this paper.
Open Datasets	Yes	Workloads. The benchmark features multiple neural network training tasks, called workloads, each consisting of a dataset, model, loss function, target metric, validation target and runtime budget. The benchmark includes two types of workloads: ﬁxed base workloads (Table 2) directly affect the benchmark score, and held-out workload variants (Table 3)... Table 2: Summary of ﬁxed base workloads in the ALGOPERF benchmark. Losses include cross-entropy (CE), mean absolute error (L1), and Connectionist Temporal Classiﬁcation loss (CTC). Additional evaluation metrics are structural similarity index measure (SSIM), (word) error rate (ER & WER), mean average precision (m AP), and bilingual evaluation understudy score (BLEU). Note: Some workloads have minor changes (see Appendix A.2) to the runtime budgets and validation targets compared to the ALGOPERF benchmark publication (Dahl et al., 2023). The runtime budget is that of the external tuning ruleset, the self-tuning ruleset allows 3ˆ longer training. Task Dataset Model Loss Metric Target Budget Clickthrough rate prediction CRITEO 1TB DLRMSMALL CE CE 0.123735 7703 MRI reconstruction FASTMRI U-NET L1 SSIM 0.7344 8859 Image IMAGENET RESNET-50 CE ER 0.22569 63,008 classiﬁcation VIT CE ER 0.22691 77,520 Speech LIBRISPEECH CONFORMER CTC WER 0.085884 61,068 recognition DEEPSPEECH CTC WER 0.119936 55,506 Molecular property prediction OGBG GNN CE m AP 0.28098 18,477 Translation WMT TRANSFORMER CE BLEU 30.8491 48,151
Dataset Splits	Yes	Workloads. The benchmark features multiple neural network training tasks, called workloads, each consisting of a dataset, model, loss function, target metric, validation target and runtime budget. The submissions objective is to train these workloads as quickly as possible; if the target is not reached within the runtime budget, the run receives an inﬁnite score.
Hardware Specification	Yes	On average, scoring an external submission required 3469 hours, and 1847 hours for a self-tuning submission, totaling 49,240 hours on the competition hardware (8ˆNVIDIA V100 GPUs) (see Appendix A.3).
Software Dependencies	Yes	Building the training harness and software for the ALGOPERF: TRAINING ALGORITHMS competition to enable fair and meaningful comparisons between training algorithms, especially across the deep learning frameworks JAX (Bradbury et al., 2018) and PYTORCH (Paszke et al., 2019), required substantial engineering effort. ... PYTORCH 2.0 (Ansel et al., 2024), released in Dec 2022, introduced two major extensions that represent a major departure from PYTORCH s (Paszke etal., 2019) original eager programming model...
Experiment Setup	Yes	The goal of the ALGOPERF: TRAINING ALGORITHMS competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on ﬁxed hardware. ... The external tuning ruleset simulates hyperparameter tuning with limited parallel resources. In this ruleset, hyperparameters are tuned using ﬁve independent trials, with hyperparameter conﬁgurations sampled via quasirandom search (Bousquet et al., 2017) from the submission s deﬁned search space,4 and are scored based on the runtime of the trial that achieves the validation target the fastest. ... The self-tuning ruleset simulates fully automated hyperparameter tuning during training on a single machine. This includes submissions that use the same hyperparameters across all workloads (e.g. ADAMW with defaults for all hyperparameters including regularization) or those that perform inner-loop tuning during the training run. ... A submission s benchmark score is based on its individual workload scores relative to those of other submissions, aggregated using performance proﬁles (Dolan & Moré, 2002; Dahl et al., 2023, Sec. 4.5).