DRACO: Byzantine-resilient Distributed Training via Redundant Gradients
Authors: Lingjiao Chen, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide extensive experiments on real datasets and distributed setups across a variety of large-scale models, where we show that DRACO is several times, to orders of magnitude faster than median-based approaches. |
| Researcher Affiliation | Academia | 1University of Wisconsin-Madison. Correspondence to: Lingjiao Chen <lchen@cs.wisc.edu>. |
| Pseudocode | Yes | Algorithm 1 Decoder Function DCyc. |
| Open Source Code | Yes | We implement DRACO in Py Torch and deploy it on distributed setups on Amazon EC2... https://github.com/hwang595/Draco |
| Open Datasets | Yes | The datasets and their associated learning models are summarized in Table 1. We use fully connected (FC) neural networks and Le Net (Le Cun et al., 1998) for MNIST, Res Net-18 (He et al., 2016) for Cifar 10 (Krizhevsky & Hinton, 2009), and CNN-rand-non-static (CRN) model in (Kim, 2014) for Movie Review (MR) (Pang & Lee, 2005). |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or exact counts) for reproducibility, beyond mentioning batch size and total iterations. While it uses standard datasets, the specific splits are not detailed. |
| Hardware Specification | Yes | We implement DRACO in Py Torch and deploy it on distributed setups on Amazon EC2... The experiments provided here are run on 46 real instances (45 compute nodes with 1 PS) on AWS EC2. For Res Net-152 and VGG-19, m4.4xlarge (equipped with 16 cores with 64 GB memory) instances are used while Alex Net experiments are run on m4.10xlarge (40 cores with 160 GB memory) instances. |
| Software Dependencies | No | We have implemented all of these in Py Torch (Paszke et al., 2017b) with MPI4py (Dalcin et al., 2011) deployed on the m4.2/4/10xlarge instances in Amazon EC2. The paper mentions PyTorch and MPI4py, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | The datasets and their associated learning models are summarized in Table 1. ...Learning Rate 0.01 / 0.01 0.1 0.001, Batch Size 720 / 720 180 180... All three methods are trained for 10,000 distributed iterations. ...At each iteration, we randomly select s = 1, 3, 5 (2.2%, 6.7%, 11.1% of all compute nodes) nodes as adversaries. |