Election Coding for Distributed Learning: Protecting SignSGD against Byzantine Attacks

Authors: Jy-yong Sohn, Dong-Jun Han, Beongjun Choi, Jaekyun Moon

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on real datasets confirm that the suggested codes provide substantial improvement in Byzantine tolerance of distributed learning systems employing Sign SGD. We implement the suggested coded distributed learning algorithms in Py Torch, and deploy them on Amazon EC2 using Python with MPI4py package. We trained RESNET-18 using CIFAR-10 dataset as well as a logistic regression model using Amazon Employee Access dataset.
Researcher Affiliation Academia Jy-yong Sohn jysohn1108@kaist.ac.kr Dong-Jun Han djhan93@kaist.ac.kr Beongjun Choi bbzang10@kaist.ac.kr Jaekyun Moon jmoon@kaist.edu School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode Yes Algorithm 1 Data allocation matrix G satisfying perfect b Byzantine tolerance (0 < b < n/2 )
Open Source Code No The paper states: "We implement the suggested coded distributed learning algorithms in Py Torch, and deploy them on Amazon EC2 using Python with MPI4py package." However, it does not explicitly provide a link or state that the code developed for this paper is open-source or publicly available.
Open Datasets Yes We trained RESNET-18 using CIFAR-10 dataset as well as a logistic regression model using Amazon Employee Access dataset. We used c4.large instances for n workers that compute batch gradients, and a single c4.2xlarge instance for the master that aggregates the gradients from workers and determines the model updating rule.
Dataset Splits No The paper mentions "ntrain = 50000 and ntest = 10000" for CIFAR-10 and "number of training data q = 26325" for Amazon Employee Access, but it does not specify any validation splits or percentages for any dataset.
Hardware Specification Yes our experiments are simulated on g4dn.xlarge instances (having a GPU) for both workers and the master. We used c4.large instances for n workers that compute batch gradients, and a single c4.2xlarge instance for the master that aggregates the gradients from workers and determines the model updating rule.
Software Dependencies No We implement the suggested coded distributed learning algorithms in Py Torch, and deploy them on Amazon EC2 using Python with MPI4py package. The paper mentions software names but does not provide specific version numbers for PyTorch, Python, or MPI4py.
Experiment Setup Yes Similar to the simulation settings in the previous works [5,6], we used the momentum counterpart SIGNUM instead of SIGNSGD for fast convergence, and used a learning rate of γ = 0.0001 and a momentum term of η = 0.9. We used stochastic mini-batch gradient descent with batch size B