reproducibilityindex.ai

BASGD: Buffered Asynchronous SGD for Byzantine Learning

Authors: Yi-Rui Yang, Wu-Jun Li

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that BASGD signiﬁcantly outperforms vanilla asynchronous stochastic gradient descent (ASGD) and other ABL baselines when there exists failure or attack on workers. In this section, we empirically evaluate the performance of BASGD and baselines in both image classiﬁcation (IC) and natural language processing (NLP) applications.
Researcher Affiliation	Academia	1National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China.
Pseudocode	Yes	Algorithm 1 Buffered Asynchronous SGD (BASGD)
Open Source Code	No	No statement or link providing access to open-source code for the described methodology was found.
Open Datasets	Yes	algorithms are evaluated on CIFAR10 (Krizhevsky et al., 2009) with deep learning model Res Net-20 (He et al., 2016). In NLP experiment, the algorithms are evaluated on the Wiki Text-2 dataset with LSTM (Hochreiter & Schmidhuber, 1997) networks.
Dataset Splits	No	The paper mentions 'Training set is randomly and equally distributed to different workers' and uses a 'test set', but does not explicitly provide specific details about training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology).
Hardware Specification	Yes	Our experiments are conducted on a distributed platform with dockers. Each docker is bound to an NVIDIA Tesla V100 (32G) GPU (in IC) or an NVIDIA Tesla K80 GPU (in NLP).
Software Dependencies	Yes	All algorithms are implemented with Py Torch 1.3.
Experiment Setup	Yes	learning rate η is set to 0.1 initially for each algorithm, and multiplied by 0.1 at the 80-th epoch and the 120-th epoch respectively. The weight decay is set to 10^-4. We run each algorithm for 160 epochs. Batch size is set to 25. Word embedding size is set to 100, and sequence length is set to 35. Gradient clipping size is set to 0.25. Cross-entropy is used as the loss function. For each algorithm, we run each algorithm for 40 epochs. Initial learning rate η is chosen from {1, 2, 5, 10, 20}, and is divided by 4 every 10 epochs.