Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
From Promise to Practice: Realizing High-performance Decentralized Training
Authors: Zesen Wang, Jiaojiao Zhang, Xuyang Wu, Mikael Johansson
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We deploy our solution in clusters with up to 64 GPUs, demonstrating its practical advantages in both runtime and generalization performance under a fixed iteration budget1. ... Extensive experiments validate the feasibility and practical benefits of decentralized training. |
| Researcher Affiliation | Academia | KTH Royal Institute of Technology, Southern University of Science and Technology EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Decentralized Adam on worker i ... Algorithm 2 All-Reduce Adam on worker i ... Algorithm 3 Decentralized Training on worker i ... Algorithm 4 Decentralized Accumulated Adam on worker i |
| Open Source Code | Yes | The experiment code is open-source at https://github.com/Wang Zesen/Decentralized-Training-Exp, and the extension code is open-source at https://github.com/Wang Zesen/Decent-DP. ... The code bases for the Py Torch extension and the experiments are open-source at https://github.com/Wang Zesen/Decent-DP and https://github.com/Wang Zesen/Decentralized Training-Exp, respectively. |
| Open Datasets | Yes | Following Vaswani et al. (2017), we trained the transformer ( 65M parameters for base variant and 213M parameters for big variant) on the English-German and English-French WMT14 dataset (Bojar et al., 2014). ... In the image classification task, we trained Res Net-50 (He et al., 2016) on Image Net-1K (Deng et al., 2009). ... For the GPT-2 pretraining task, we trained GPT-2 (small) (Radford et al., 2019) on Open Web Text Gokaslan et al. (2019) |
| Dataset Splits | Yes | In each run, we evaluate the trained model by BLEU (Papineni et al., 2002) and METEOR (Banerjee & Lavie, 2005) on the test set. ... In each run, we evaluate the trained model by Top-1 and Top-5 accuracies on the validation set. ... and we report the training and validation losses of the trained models. |
| Hardware Specification | Yes | Table 1: Communication time of performing averaging operations for 8196 iterations over three 25MB FP32 tensors (selection of 25MB is based on the default bucket size used in Py Torch DDP (Li et al., 2020)). Setup 1: 16 A100 GPUs on 4 nodes inter-connected by 100Gbps Infiniband. Setup 2: 16 A40 GPUs on 4 nodes inter-connected by 25Gbps Ethernet. ... nodes with 8 T4 GPU with 100Gbps Infini Band connection ... Hardware configuration: 16 CPU cores11 and 64 GB RAM per GPU. For nodes with 4 A10012 each, the nodes are interconnected with 100Gbps Infini Band. For nodes with 4 A4013 each, the nodes are interconnected with 25Gbps Ethernet. ... Intel(R) Xeon(R) Gold 6338 CPU @ 2GHz ... NVIDIA Tesla A100 HGX GPU with 40GB v RAM ... NVIDIA Tesla A40 |
| Software Dependencies | Yes | All of our experiment implementation is based on Py Torch (Li et al., 2020). ... Key software: Py Torch 2.5.1, CUDA 12.1, ffcv 1.0.2. |
| Experiment Setup | Yes | We consider three large-scale tasks: neural machine translation, image classification, and GPT-2 pretraining. ... The global batch size is fixed to around 25k tokens for source tokens and target tokens, respectively. The models were trained for 130,000 steps (or 24 epochs) in total. ... All experiments use 0.1 as the dropout rate and 0.1 as the label smoothing. The learning rate schedule is the inversed square root with 1 epoch of linear warmup. ... the learning rate is 0.0007, and the betas for the Adam optimizer are (0.9, 0.98) ... the betas for the Adam optimizer are set to (0.974, 0.999) ... The base learning rates are 0.0020, 0.0028, and 0.0038 for 8, 16, and 32 workers, respectively. ... In all experiments, the global batch size is fixed to 1024, the learning rate schedule is cosine annealing with 5 epochs of linear warmup, the weight decay is 3e-5 on non-batch-normalization parameters, and the label smoothing is 0.1. |