Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Nesterov Method for Asynchronous Pipeline Parallel Optimization
Authors: Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, Alexander Long
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the merits of our approach on large-scale language modelling tasks... Our experiments clearly demonstrate the feasibility of asynchronous PP optimization in the large-scale setting. 5. Experiments: We evaluate our method on the language modelling task using decoder-only architectures. We use three large-scale datasets: Wiki Text (WT) (Merity et al., 2016), Book Corpus (BC) (Zhu et al., 2015), and Open Web Text (OWT) (Gokaslan et al., 2019) datasets. ... Ablation Study |
| Researcher Affiliation | Industry | 1Pluralis Research. Correspondence to: Thalaiyasingam Ajanthan <EMAIL>. |
| Pseudocode | No | The paper includes mathematical equations for the Nesterov method but does not present any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/ Pluralis Research/Async PP. |
| Open Datasets | Yes | We use three large-scale datasets: Wiki Text (WT) (Merity et al., 2016), Book Corpus (BC) (Zhu et al., 2015), and Open Web Text (OWT) (Gokaslan et al., 2019) datasets. |
| Dataset Splits | Yes | For Wiki Text, we utilize the predefined training and validation splits, for the other datasets, we randomly select 10% of the training set as the held-out validation set. |
| Hardware Specification | Yes | All experiments are performed on a system equipped with 8 A10G GPUs. ... These experiments are performed on a system equipped with 8 A100 GPUs. ... Each worker node is assigned an NVIDIA L4 GPU. |
| Software Dependencies | Yes | In the Py Torch implementation of NAdam (Py Torch Contributors, 2025)... Nadam optimizer pytorch 2.5.0 documentation. https://pytorch.org/docs/ stable/generated/torch.optim.NAdam. html, 2025. Accessed: 2025-01-16. |
| Experiment Setup | Yes | Across all experiments, we maintain a microbatch size of 8, a learning rate η of 3e-4, and a weight decay of 0.01, unless otherwise specified. ... Each experiment is run for 50k iterations, with a linear warmup of 3k iterations starting from a learning rate of 1e-7. Then, it is decayed to 3e-5 following a cosine decay schedule. ... Our proposed method is denoted as Ours, which employs the Nadam optimizer (Dozat, 2016) with decoupled weight decay and a momentum coefficient β1 of 0.99. |