Decentralized Deep Learning with Arbitrary Communication Compression

Authors: Anastasia Koloskova*, Tao Lin*, Sebastian U Stich, Martin Jaggi

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over decentralized user devices, connected by a peer-to-peer network and (ii) in a datacenter.
Researcher Affiliation Academia Anastasia Koloskova anastasia.koloskova@epfl.ch Tao Lin tao.lin@epfl.ch Sebastian U. Stich sebastian.stich@epfl.ch Martin Jaggi martin.jaggi@epfl.ch EPFL Lausanne, Switzerland
Pseudocode Yes Algorithm 1 CHOCO-SGD (Koloskova et al., 2019) ... Algorithm 2 CHOCO-SGD with Momentum
Open Source Code Yes Our implementations are open-source and available at https://github.com/epfml/Choco SGD.
Open Datasets Yes Cifar10 dataset (50K/10K training/test samples) (Krizhevsky, 2012)... Image Net-1k (1.28M/50K training/validation) (Deng et al., 2009)... Wiki Text-2 (600 training and 60 validation articles with a total of 2 088 628 and 217 646 tokens respectively) (Merity et al., 2016).
Dataset Splits Yes Cifar10 dataset (50K/10K training/test samples)... Image Net-1k (1.28M/50K training/validation)... Wiki Text-2 (600 training and 60 validation articles with a total of 2 088 628 and 217 646 tokens respectively)
Hardware Specification Yes We perform our experiments on 8 machines (n1-standard-32 from Google Cloud with Intel Ivy Bridge CPU platform), where each of machines has 4 Tesla P100 GPUs and each machine interconnected via 10Gbps Ethernet.
Software Dependencies No The paper mentions common software and libraries used in deep learning such as PyTorch and ResNet, but does not provide specific version numbers for any key software components or dependencies required for reproducibility.
Experiment Setup Yes For all algorithms we fine-tune the initial learning rate and gradually warm it up from a relative small value (0.1) (Goyal et al., 2017) for the first 5 epochs. The learning rate is decayed by 10 twice, at 150 and 225 epochs, and stop training at 300 epochs. For CHOCO-SGD and Deep Squeeze the consensus learning rate γ is also tuned. The detailed hyper-parameter tuning procedure refers to Appendix F. ... Table 4 demonstrates the fine-tuned hpyerparameters of CHOCO-SGD for training Res Net-20 on Cifar10, while Table 6 reports our fine-tuned hpyerparameters of our baselines. Table 5 demonstrates the fine-tuned hpyerparameters of CHOCO-SGD for training Res Net-20/LSTM on a social network topology.