Decentralized Deep Learning with Arbitrary Communication Compression
Authors: Anastasia Koloskova*, Tao Lin*, Sebastian U Stich, Martin Jaggi
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over decentralized user devices, connected by a peer-to-peer network and (ii) in a datacenter. |
| Researcher Affiliation | Academia | Anastasia Koloskova anastasia.koloskova@epfl.ch Tao Lin tao.lin@epfl.ch Sebastian U. Stich sebastian.stich@epfl.ch Martin Jaggi martin.jaggi@epfl.ch EPFL Lausanne, Switzerland |
| Pseudocode | Yes | Algorithm 1 CHOCO-SGD (Koloskova et al., 2019) ... Algorithm 2 CHOCO-SGD with Momentum |
| Open Source Code | Yes | Our implementations are open-source and available at https://github.com/epfml/Choco SGD. |
| Open Datasets | Yes | Cifar10 dataset (50K/10K training/test samples) (Krizhevsky, 2012)... Image Net-1k (1.28M/50K training/validation) (Deng et al., 2009)... Wiki Text-2 (600 training and 60 validation articles with a total of 2 088 628 and 217 646 tokens respectively) (Merity et al., 2016). |
| Dataset Splits | Yes | Cifar10 dataset (50K/10K training/test samples)... Image Net-1k (1.28M/50K training/validation)... Wiki Text-2 (600 training and 60 validation articles with a total of 2 088 628 and 217 646 tokens respectively) |
| Hardware Specification | Yes | We perform our experiments on 8 machines (n1-standard-32 from Google Cloud with Intel Ivy Bridge CPU platform), where each of machines has 4 Tesla P100 GPUs and each machine interconnected via 10Gbps Ethernet. |
| Software Dependencies | No | The paper mentions common software and libraries used in deep learning such as PyTorch and ResNet, but does not provide specific version numbers for any key software components or dependencies required for reproducibility. |
| Experiment Setup | Yes | For all algorithms we fine-tune the initial learning rate and gradually warm it up from a relative small value (0.1) (Goyal et al., 2017) for the first 5 epochs. The learning rate is decayed by 10 twice, at 150 and 225 epochs, and stop training at 300 epochs. For CHOCO-SGD and Deep Squeeze the consensus learning rate γ is also tuned. The detailed hyper-parameter tuning procedure refers to Appendix F. ... Table 4 demonstrates the fine-tuned hpyerparameters of CHOCO-SGD for training Res Net-20 on Cifar10, while Table 6 reports our fine-tuned hpyerparameters of our baselines. Table 5 demonstrates the fine-tuned hpyerparameters of CHOCO-SGD for training Res Net-20/LSTM on a social network topology. |