Asynchronous Decentralized Parallel Stochastic Gradient Descent
Authors: Xiangru Lian, Wei Zhang, Ce Zhang, Ji Liu
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, AD-PSGD outperforms the best of decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (APSGD), and standard data parallel SGD (All Reduce SGD), often by orders of magnitude in a heterogeneous environment. When training Res Net-50 on Image Net with up to 128 GPUs, AD-PSGD converges (w.r.t epochs) similarly to the All Reduce-SGD, but each epoch can be up to 4-8 faster than its synchronous counterparts in a network-sharing HPC environment. 5 Experiments |
| Researcher Affiliation | Collaboration | Xiangru Lian 1 * Wei Zhang 2 * Ce Zhang 3 Ji Liu 4 1Department of Computer Science, University of Rochester 2IBM T. J. Watson Research Center 3Department of Computer Science, ETH Zurich 4Tencent AI lab, Seattle, USA. |
| Pseudocode | Yes | Algorithm 1 AD-PSGD (logical view) |
| Open Source Code | No | The paper does not provide an explicit statement about releasing the source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We use CIFAR10 and Image Net-1K as the evaluation dataset and we use Torch-7 as our deep learning framework. |
| Dataset Splits | No | The paper mentions using CIFAR10 and Image Net-1K datasets, but does not explicitly provide the training/test/validation dataset splits, percentages, or refer to predefined splits with citations. |
| Hardware Specification | Yes | IBM S822LC HPC cluster: Each node with 4 Nvidia P100 GPUs, 160 Power8 cores (8-way SMT) and 500GB memory on each node. 100Gbit/s Mellanox EDR infiniband network. We use 32 such nodes. x86-based cluster: This cluster is a cloud-like environment with 10Gbit/s ethernet connection. Each node has 4 Nvidia P100 GPUs, 56 Xeon E5-2680 cores (2-way SMT), and 1TB DRAM. We use 4 such nodes. |
| Software Dependencies | No | The paper mentions 'Torch-7 as our deep learning framework' and 'MPI to implement the communication scheme', but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | Batch size: 128 per worker for VGG, 32 for Res Net-20. Learning rate: For VGG start from 1 and reduce by half every 25 epochs. For Res Net-20 start from 0.1 and decay by a factor of 10 at the 81st epoch and the 122nd epoch. Momentum: 0.9. Weight decay: 10−4. |