Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Accelerating neural network training: An analysis of the AlgoPerf competition
Authors: Priya Kasimbeg, Frank Schneider, Runa Eschenhagen, Juhan Bae, Chandramouli Shama Sastry, Mark Saroufim, BOYUAN FENG, Less Wright, Edward Yang, Zachary Nado, Sourabh Medapati, Philipp Hennig, Michael Rabbat, George Dahl
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents the inaugural ALGOPERF competition s results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using DISTRIBUTED SHAMPOO, demonstrates the effectiveness of nondiagonal preconditioning over popular methods like ADAM, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the SCHEDULE FREE ADAMW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The topscoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2University of Tübingen 3University of Cambridge 4Vector Institute 5University of Toronto 6Dalhousie University 7Meta |
| Pseudocode | No | The paper discusses various algorithms (e.g., DISTRIBUTED SHAMPOO, SCHEDULE FREE ADAMW, ADAM) and their performance in a competition. It describes their characteristics and results but does not provide any structured pseudocode or algorithm blocks for these methods within the text. |
| Open Source Code | No | The paper mentions links to the competition rules and documentation (github.com/mlcommons/algorithmic-efficiency/[...]/COMPETITION_RULES.md and DOCUMENTATION.md) and a specific pull request for a JAX LSTM layer implementation (https://github.com/google/jax/pull/13319). However, it does not contain an explicit statement or a direct link to the source code for the specific methodology or analysis presented in this paper. |
| Open Datasets | Yes | Workloads. The benchmark features multiple neural network training tasks, called workloads, each consisting of a dataset, model, loss function, target metric, validation target and runtime budget. The benchmark includes two types of workloads: fixed base workloads (Table 2) directly affect the benchmark score, and held-out workload variants (Table 3)... Table 2: Summary of fixed base workloads in the ALGOPERF benchmark. Losses include cross-entropy (CE), mean absolute error (L1), and Connectionist Temporal Classification loss (CTC). Additional evaluation metrics are structural similarity index measure (SSIM), (word) error rate (ER & WER), mean average precision (m AP), and bilingual evaluation understudy score (BLEU). Note: Some workloads have minor changes (see Appendix A.2) to the runtime budgets and validation targets compared to the ALGOPERF benchmark publication (Dahl et al., 2023). The runtime budget is that of the external tuning ruleset, the self-tuning ruleset allows 3ˆ longer training. Task Dataset Model Loss Metric Target Budget Clickthrough rate prediction CRITEO 1TB DLRMSMALL CE CE 0.123735 7703 MRI reconstruction FASTMRI U-NET L1 SSIM 0.7344 8859 Image IMAGENET RESNET-50 CE ER 0.22569 63,008 classification VIT CE ER 0.22691 77,520 Speech LIBRISPEECH CONFORMER CTC WER 0.085884 61,068 recognition DEEPSPEECH CTC WER 0.119936 55,506 Molecular property prediction OGBG GNN CE m AP 0.28098 18,477 Translation WMT TRANSFORMER CE BLEU 30.8491 48,151 |
| Dataset Splits | Yes | Workloads. The benchmark features multiple neural network training tasks, called workloads, each consisting of a dataset, model, loss function, target metric, validation target and runtime budget. The submissions objective is to train these workloads as quickly as possible; if the target is not reached within the runtime budget, the run receives an infinite score. |
| Hardware Specification | Yes | On average, scoring an external submission required 3469 hours, and 1847 hours for a self-tuning submission, totaling 49,240 hours on the competition hardware (8ˆNVIDIA V100 GPUs) (see Appendix A.3). |
| Software Dependencies | Yes | Building the training harness and software for the ALGOPERF: TRAINING ALGORITHMS competition to enable fair and meaningful comparisons between training algorithms, especially across the deep learning frameworks JAX (Bradbury et al., 2018) and PYTORCH (Paszke et al., 2019), required substantial engineering effort. ... PYTORCH 2.0 (Ansel et al., 2024), released in Dec 2022, introduced two major extensions that represent a major departure from PYTORCH s (Paszke etal., 2019) original eager programming model... |
| Experiment Setup | Yes | The goal of the ALGOPERF: TRAINING ALGORITHMS competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. ... The external tuning ruleset simulates hyperparameter tuning with limited parallel resources. In this ruleset, hyperparameters are tuned using five independent trials, with hyperparameter configurations sampled via quasirandom search (Bousquet et al., 2017) from the submission s defined search space,4 and are scored based on the runtime of the trial that achieves the validation target the fastest. ... The self-tuning ruleset simulates fully automated hyperparameter tuning during training on a single machine. This includes submissions that use the same hyperparameters across all workloads (e.g. ADAMW with defaults for all hyperparameters including regularization) or those that perform inner-loop tuning during the training run. ... A submission s benchmark score is based on its individual workload scores relative to those of other submissions, aggregated using performance profiles (Dolan & Moré, 2002; Dahl et al., 2023, Sec. 4.5). |