reproducibilityindex.ai

DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

Authors: Lingjiao Chen, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide extensive experiments on real datasets and distributed setups across a variety of large-scale models, where we show that DRACO is several times, to orders of magnitude faster than median-based approaches.
Researcher Affiliation	Academia	1University of Wisconsin-Madison. Correspondence to: Lingjiao Chen <lchen@cs.wisc.edu>.
Pseudocode	Yes	Algorithm 1 Decoder Function DCyc.
Open Source Code	Yes	We implement DRACO in Py Torch and deploy it on distributed setups on Amazon EC2... https://github.com/hwang595/Draco
Open Datasets	Yes	The datasets and their associated learning models are summarized in Table 1. We use fully connected (FC) neural networks and Le Net (Le Cun et al., 1998) for MNIST, Res Net-18 (He et al., 2016) for Cifar 10 (Krizhevsky & Hinton, 2009), and CNN-rand-non-static (CRN) model in (Kim, 2014) for Movie Review (MR) (Pang & Lee, 2005).
Dataset Splits	No	The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or exact counts) for reproducibility, beyond mentioning batch size and total iterations. While it uses standard datasets, the specific splits are not detailed.
Hardware Specification	Yes	We implement DRACO in Py Torch and deploy it on distributed setups on Amazon EC2... The experiments provided here are run on 46 real instances (45 compute nodes with 1 PS) on AWS EC2. For Res Net-152 and VGG-19, m4.4xlarge (equipped with 16 cores with 64 GB memory) instances are used while Alex Net experiments are run on m4.10xlarge (40 cores with 160 GB memory) instances.
Software Dependencies	No	We have implemented all of these in Py Torch (Paszke et al., 2017b) with MPI4py (Dalcin et al., 2011) deployed on the m4.2/4/10xlarge instances in Amazon EC2. The paper mentions PyTorch and MPI4py, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	The datasets and their associated learning models are summarized in Table 1. ...Learning Rate 0.01 / 0.01 0.1 0.001, Batch Size 720 / 720 180 180... All three methods are trained for 10,000 distributed iterations. ...At each iteration, we randomly select s = 1, 3, 5 (2.2%, 6.7%, 11.1% of all compute nodes) nodes as adversaries.