Exponential Graph is Provably Efficient for Decentralized Deep Training
Authors: Bicheng Ying, Kun Yuan, Yiming Chen, Hanbin Hu, PAN PAN, Wotao Yin
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive industry-level experiments across different tasks and models with various decentralized methods, graphs, and network size to validate our theoretical results. |
| Researcher Affiliation | Collaboration | Bicheng Ying1,3 , Kun Yuan2 , Yiming Chen2 , Hanbin Hu4, Pan Pan2, Wotao Yin2 1 University of California, Los Angeles 2 DAMO Academy, Alibaba Group 3 Google Inc. 4 University of California, Santa Barbara ybc@ucla.edu, {kun.yuan, charles.cym}@alibaba-inc.com, hanbinhu@ucsb.edu, {panpan.pp, wotao.yin}@alibaba-inc.com |
| Pseudocode | Yes | Algorithm 1 Dm SGD |
| Open Source Code | Yes | Our code is implemented through Blue Fog and available at https://github.com/Bluefog-Lib/Neur IPS2021-Exponential-Graph. |
| Open Datasets | Yes | We conduct a series of image classification experiments with the Image Net-1K [16], which consists of 1,281,167 training images and 50,000 validation images in 1000 classes. |
| Dataset Splits | Yes | We conduct a series of image classification experiments with the Image Net-1K [16], which consists of 1,281,167 training images and 50,000 validation images in 1000 classes. |
| Hardware Specification | Yes | Each server contains 8 V100 GPUs in our cluster and is treated as one node. |
| Software Dependencies | Yes | We implement all decentralized algorithms with Py Torch [46] 1.8.0 using NCCL 2.8.3 (CUDA 10.1) as the communication backend. For the implementation of decentralized methods, we utilize Blue Fog [63]. |
| Experiment Setup | Yes | The training protocol in [21] is used. In details, we train total 90 epochs. The learning rate is warmed up in the first 5 epochs and is decayed by a factor of 10 at 30, 60 and 80-th epoch. The momentum SGD optimizer is used with linear learning rate scaling by default. Experiments are trained in the mixed precision using Pytorch native amp module. |