reproducibilityindex.ai

Distributed Learning over Unreliable Networks

Authors: Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The technical contribution of this paper is a novel theoretical analysis proving that distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks. Further, we prove that the inﬂuence of the packet drop rate diminishes with the growth of the number of parameter servers. We map this theoretical result onto a real-world scenario, training deep neural networks over an unreliable network layer, and conduct network simulation to validate the system improvement by allowing the networks to be unreliable.
Researcher Affiliation	Collaboration	1Department of Computer Science, University of Rochester, USA 2Department of Computer Science, ETH Zurich 3Institute of Science and Technology Austria 4Seattle AI Lab, Fe DA Lab, Kwai Inc. Correspondence to: Chen Yu <cyu28@ur.rochester.edu>.
Pseudocode	Yes	Algorithm 1 RPS
Open Source Code	No	The paper states 'We implement the RPS algorithm using MPI' but does not provide any specific link or explicit statement about releasing the source code for their implementation.
Open Datasets	Yes	We train Res Net (He et al., 2016) with different number of layers on CIFAR-10 (Krizhevsky & Hinton, 2009) for classifying images. We perform the NLU task on the Air travel information system (ATIS) corpus on a one layer LSTM network.
Dataset Splits	No	The paper mentions training loss convergence and validation trends but does not provide explicit details on how the datasets (CIFAR-10, ATIS) were split into training, validation, and test sets. While CIFAR-10 has a standard split, the paper does not confirm its usage or provide details for ATIS.
Hardware Specification	Yes	The training of the models is executed on 16 NVIDIA TITAN Xp GPUs. The workers are connected by Gigabit Ethernet. We use each GPU as a worker.
Software Dependencies	Yes	We simulate packet losses by adapting the latest version 2.5 of the Microsoft Cognitive Toolkit (Seide & Agarwal, 2016). We implement the RPS algorithm using MPI.
Experiment Setup	Yes	During training, we use a local batch size of 32 samples per worker for image classiﬁcation. We adjust the learning rate by applying a linear scaling rule (Goyal et al., 2017) and decay of 10 percent after 80 and 120 epochs, respectively. To achieve the best possible convergence, we apply a gradual warmup strategy (Goyal et al., 2017) during the ﬁrst 5 epochs. We deliberately do not use any regularization or momentum during the experiments in order to be consistent with the described algorithm and proof.