Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

From Promise to Practice: Realizing High-performance Decentralized Training

Authors: Zesen Wang, Jiaojiao Zhang, Xuyang Wu, Mikael Johansson

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We deploy our solution in clusters with up to 64 GPUs, demonstrating its practical advantages in both runtime and generalization performance under a fixed iteration budget1. ... Extensive experiments validate the feasibility and practical benefits of decentralized training.
Researcher Affiliation	Academia	KTH Royal Institute of Technology, Southern University of Science and Technology EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Decentralized Adam on worker i ... Algorithm 2 All-Reduce Adam on worker i ... Algorithm 3 Decentralized Training on worker i ... Algorithm 4 Decentralized Accumulated Adam on worker i
Open Source Code	Yes	The experiment code is open-source at https://github.com/Wang Zesen/Decentralized-Training-Exp, and the extension code is open-source at https://github.com/Wang Zesen/Decent-DP. ... The code bases for the Py Torch extension and the experiments are open-source at https://github.com/Wang Zesen/Decent-DP and https://github.com/Wang Zesen/Decentralized Training-Exp, respectively.
Open Datasets	Yes	Following Vaswani et al. (2017), we trained the transformer ( 65M parameters for base variant and 213M parameters for big variant) on the English-German and English-French WMT14 dataset (Bojar et al., 2014). ... In the image classification task, we trained Res Net-50 (He et al., 2016) on Image Net-1K (Deng et al., 2009). ... For the GPT-2 pretraining task, we trained GPT-2 (small) (Radford et al., 2019) on Open Web Text Gokaslan et al. (2019)
Dataset Splits	Yes	In each run, we evaluate the trained model by BLEU (Papineni et al., 2002) and METEOR (Banerjee & Lavie, 2005) on the test set. ... In each run, we evaluate the trained model by Top-1 and Top-5 accuracies on the validation set. ... and we report the training and validation losses of the trained models.
Hardware Specification	Yes	Table 1: Communication time of performing averaging operations for 8196 iterations over three 25MB FP32 tensors (selection of 25MB is based on the default bucket size used in Py Torch DDP (Li et al., 2020)). Setup 1: 16 A100 GPUs on 4 nodes inter-connected by 100Gbps Infiniband. Setup 2: 16 A40 GPUs on 4 nodes inter-connected by 25Gbps Ethernet. ... nodes with 8 T4 GPU with 100Gbps Infini Band connection ... Hardware configuration: 16 CPU cores11 and 64 GB RAM per GPU. For nodes with 4 A10012 each, the nodes are interconnected with 100Gbps Infini Band. For nodes with 4 A4013 each, the nodes are interconnected with 25Gbps Ethernet. ... Intel(R) Xeon(R) Gold 6338 CPU @ 2GHz ... NVIDIA Tesla A100 HGX GPU with 40GB v RAM ... NVIDIA Tesla A40
Software Dependencies	Yes	All of our experiment implementation is based on Py Torch (Li et al., 2020). ... Key software: Py Torch 2.5.1, CUDA 12.1, ffcv 1.0.2.
Experiment Setup	Yes	We consider three large-scale tasks: neural machine translation, image classification, and GPT-2 pretraining. ... The global batch size is fixed to around 25k tokens for source tokens and target tokens, respectively. The models were trained for 130,000 steps (or 24 epochs) in total. ... All experiments use 0.1 as the dropout rate and 0.1 as the label smoothing. The learning rate schedule is the inversed square root with 1 epoch of linear warmup. ... the learning rate is 0.0007, and the betas for the Adam optimizer are (0.9, 0.98) ... the betas for the Adam optimizer are set to (0.974, 0.999) ... The base learning rates are 0.0020, 0.0028, and 0.0038 for 8, 16, and 32 workers, respectively. ... In all experiments, the global batch size is fixed to 1024, the learning rate schedule is cosine annealing with 5 epochs of linear warmup, the weight decay is 3e-5 on non-batch-normalization parameters, and the label smoothing is 0.1.