Distributed Learning over Unreliable Networks
Authors: Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The technical contribution of this paper is a novel theoretical analysis proving that distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks. Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of parameter servers. We map this theoretical result onto a real-world scenario, training deep neural networks over an unreliable network layer, and conduct network simulation to validate the system improvement by allowing the networks to be unreliable. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of Rochester, USA 2Department of Computer Science, ETH Zurich 3Institute of Science and Technology Austria 4Seattle AI Lab, Fe DA Lab, Kwai Inc. Correspondence to: Chen Yu <cyu28@ur.rochester.edu>. |
| Pseudocode | Yes | Algorithm 1 RPS |
| Open Source Code | No | The paper states 'We implement the RPS algorithm using MPI' but does not provide any specific link or explicit statement about releasing the source code for their implementation. |
| Open Datasets | Yes | We train Res Net (He et al., 2016) with different number of layers on CIFAR-10 (Krizhevsky & Hinton, 2009) for classifying images. We perform the NLU task on the Air travel information system (ATIS) corpus on a one layer LSTM network. |
| Dataset Splits | No | The paper mentions training loss convergence and validation trends but does not provide explicit details on how the datasets (CIFAR-10, ATIS) were split into training, validation, and test sets. While CIFAR-10 has a standard split, the paper does not confirm its usage or provide details for ATIS. |
| Hardware Specification | Yes | The training of the models is executed on 16 NVIDIA TITAN Xp GPUs. The workers are connected by Gigabit Ethernet. We use each GPU as a worker. |
| Software Dependencies | Yes | We simulate packet losses by adapting the latest version 2.5 of the Microsoft Cognitive Toolkit (Seide & Agarwal, 2016). We implement the RPS algorithm using MPI. |
| Experiment Setup | Yes | During training, we use a local batch size of 32 samples per worker for image classification. We adjust the learning rate by applying a linear scaling rule (Goyal et al., 2017) and decay of 10 percent after 80 and 120 epochs, respectively. To achieve the best possible convergence, we apply a gradual warmup strategy (Goyal et al., 2017) during the first 5 epochs. We deliberately do not use any regularization or momentum during the experiments in order to be consistent with the described algorithm and proof. |