reproducibilityindex.ai

Asynchronous Decentralized Parallel Stochastic Gradient Descent

Authors: Xiangru Lian, Wei Zhang, Ce Zhang, Ji Liu

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, AD-PSGD outperforms the best of decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (APSGD), and standard data parallel SGD (All Reduce SGD), often by orders of magnitude in a heterogeneous environment. When training Res Net-50 on Image Net with up to 128 GPUs, AD-PSGD converges (w.r.t epochs) similarly to the All Reduce-SGD, but each epoch can be up to 4-8 faster than its synchronous counterparts in a network-sharing HPC environment. 5 Experiments
Researcher Affiliation	Collaboration	Xiangru Lian 1 * Wei Zhang 2 * Ce Zhang 3 Ji Liu 4 1Department of Computer Science, University of Rochester 2IBM T. J. Watson Research Center 3Department of Computer Science, ETH Zurich 4Tencent AI lab, Seattle, USA.
Pseudocode	Yes	Algorithm 1 AD-PSGD (logical view)
Open Source Code	No	The paper does not provide an explicit statement about releasing the source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We use CIFAR10 and Image Net-1K as the evaluation dataset and we use Torch-7 as our deep learning framework.
Dataset Splits	No	The paper mentions using CIFAR10 and Image Net-1K datasets, but does not explicitly provide the training/test/validation dataset splits, percentages, or refer to predefined splits with citations.
Hardware Specification	Yes	IBM S822LC HPC cluster: Each node with 4 Nvidia P100 GPUs, 160 Power8 cores (8-way SMT) and 500GB memory on each node. 100Gbit/s Mellanox EDR infiniband network. We use 32 such nodes. x86-based cluster: This cluster is a cloud-like environment with 10Gbit/s ethernet connection. Each node has 4 Nvidia P100 GPUs, 56 Xeon E5-2680 cores (2-way SMT), and 1TB DRAM. We use 4 such nodes.
Software Dependencies	No	The paper mentions 'Torch-7 as our deep learning framework' and 'MPI to implement the communication scheme', but it does not specify version numbers for these software components.
Experiment Setup	Yes	Batch size: 128 per worker for VGG, 32 for Res Net-20. Learning rate: For VGG start from 1 and reduce by half every 25 epochs. For Res Net-20 start from 0.1 and decay by a factor of 10 at the 81st epoch and the 122nd epoch. Momentum: 0.9. Weight decay: 10−4.