Distributed Learning over Unreliable Networks

Authors: Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The technical contribution of this paper is a novel theoretical analysis proving that distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks. Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of parameter servers. We map this theoretical result onto a real-world scenario, training deep neural networks over an unreliable network layer, and conduct network simulation to validate the system improvement by allowing the networks to be unreliable.
Researcher Affiliation Collaboration 1Department of Computer Science, University of Rochester, USA 2Department of Computer Science, ETH Zurich 3Institute of Science and Technology Austria 4Seattle AI Lab, Fe DA Lab, Kwai Inc. Correspondence to: Chen Yu <cyu28@ur.rochester.edu>.
Pseudocode Yes Algorithm 1 RPS
Open Source Code No The paper states 'We implement the RPS algorithm using MPI' but does not provide any specific link or explicit statement about releasing the source code for their implementation.
Open Datasets Yes We train Res Net (He et al., 2016) with different number of layers on CIFAR-10 (Krizhevsky & Hinton, 2009) for classifying images. We perform the NLU task on the Air travel information system (ATIS) corpus on a one layer LSTM network.
Dataset Splits No The paper mentions training loss convergence and validation trends but does not provide explicit details on how the datasets (CIFAR-10, ATIS) were split into training, validation, and test sets. While CIFAR-10 has a standard split, the paper does not confirm its usage or provide details for ATIS.
Hardware Specification Yes The training of the models is executed on 16 NVIDIA TITAN Xp GPUs. The workers are connected by Gigabit Ethernet. We use each GPU as a worker.
Software Dependencies Yes We simulate packet losses by adapting the latest version 2.5 of the Microsoft Cognitive Toolkit (Seide & Agarwal, 2016). We implement the RPS algorithm using MPI.
Experiment Setup Yes During training, we use a local batch size of 32 samples per worker for image classification. We adjust the learning rate by applying a linear scaling rule (Goyal et al., 2017) and decay of 10 percent after 80 and 120 epochs, respectively. To achieve the best possible convergence, we apply a gradual warmup strategy (Goyal et al., 2017) during the first 5 epochs. We deliberately do not use any regularization or momentum during the experiments in order to be consistent with the described algorithm and proof.