Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

Authors: Cong Xie, Sanmi Koyejo, Indranil Gupta

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Zeno outperforms existing approaches.In this section, we evaluate the fault tolerance of the proposed algorithm.
Researcher Affiliation Academia 1Department of Computer Science, University of Illinois, Urbana-Champaign, USA. Correspondence to: Cong Xie <cx2@illinois.edu>.
Pseudocode Yes Algorithm 1 Zeno
Open Source Code Yes The detailed network architecture can be found in https:// github.com/xcgoner/icml2019_zeno.
Open Datasets Yes We conduct experiments on benchmark CIFAR-10 image classification dataset (Krizhevsky & Hinton, 2009)
Dataset Splits No The paper mentions '50k images for training and 10k images for testing' but does not specify a validation set or detailed split percentages or methodology beyond this general training/testing division for reproducibility.
Hardware Specification No The paper states 'In each experiment, we launch 20 worker processes' but provides no specific details on the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, or TensorFlow versions) that are required to replicate the experiments.
Experiment Setup Yes In all the experiments, we take the learning rate γ = 0.1, worker batch size 100, Zeno batch size nr = 4, and ρ = 0.0005.Each epoch has 25 iterations.