BASGD: Buffered Asynchronous SGD for Byzantine Learning

Authors: Yi-Rui Yang, Wu-Jun Li

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show that BASGD significantly outperforms vanilla asynchronous stochastic gradient descent (ASGD) and other ABL baselines when there exists failure or attack on workers. In this section, we empirically evaluate the performance of BASGD and baselines in both image classification (IC) and natural language processing (NLP) applications.
Researcher Affiliation Academia 1National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China.
Pseudocode Yes Algorithm 1 Buffered Asynchronous SGD (BASGD)
Open Source Code No No statement or link providing access to open-source code for the described methodology was found.
Open Datasets Yes algorithms are evaluated on CIFAR10 (Krizhevsky et al., 2009) with deep learning model Res Net-20 (He et al., 2016). In NLP experiment, the algorithms are evaluated on the Wiki Text-2 dataset with LSTM (Hochreiter & Schmidhuber, 1997) networks.
Dataset Splits No The paper mentions 'Training set is randomly and equally distributed to different workers' and uses a 'test set', but does not explicitly provide specific details about training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology).
Hardware Specification Yes Our experiments are conducted on a distributed platform with dockers. Each docker is bound to an NVIDIA Tesla V100 (32G) GPU (in IC) or an NVIDIA Tesla K80 GPU (in NLP).
Software Dependencies Yes All algorithms are implemented with Py Torch 1.3.
Experiment Setup Yes learning rate η is set to 0.1 initially for each algorithm, and multiplied by 0.1 at the 80-th epoch and the 120-th epoch respectively. The weight decay is set to 10^-4. We run each algorithm for 160 epochs. Batch size is set to 25. Word embedding size is set to 100, and sequence length is set to 35. Gradient clipping size is set to 0.25. Cross-entropy is used as the loss function. For each algorithm, we run each algorithm for 40 epochs. Initial learning rate η is chosen from {1, 2, 5, 10, 20}, and is divided by 4 every 10 epochs.