BASGD: Buffered Asynchronous SGD for Byzantine Learning
Authors: Yi-Rui Yang, Wu-Jun Li
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that BASGD significantly outperforms vanilla asynchronous stochastic gradient descent (ASGD) and other ABL baselines when there exists failure or attack on workers. In this section, we empirically evaluate the performance of BASGD and baselines in both image classification (IC) and natural language processing (NLP) applications. |
| Researcher Affiliation | Academia | 1National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China. |
| Pseudocode | Yes | Algorithm 1 Buffered Asynchronous SGD (BASGD) |
| Open Source Code | No | No statement or link providing access to open-source code for the described methodology was found. |
| Open Datasets | Yes | algorithms are evaluated on CIFAR10 (Krizhevsky et al., 2009) with deep learning model Res Net-20 (He et al., 2016). In NLP experiment, the algorithms are evaluated on the Wiki Text-2 dataset with LSTM (Hochreiter & Schmidhuber, 1997) networks. |
| Dataset Splits | No | The paper mentions 'Training set is randomly and equally distributed to different workers' and uses a 'test set', but does not explicitly provide specific details about training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology). |
| Hardware Specification | Yes | Our experiments are conducted on a distributed platform with dockers. Each docker is bound to an NVIDIA Tesla V100 (32G) GPU (in IC) or an NVIDIA Tesla K80 GPU (in NLP). |
| Software Dependencies | Yes | All algorithms are implemented with Py Torch 1.3. |
| Experiment Setup | Yes | learning rate η is set to 0.1 initially for each algorithm, and multiplied by 0.1 at the 80-th epoch and the 120-th epoch respectively. The weight decay is set to 10^-4. We run each algorithm for 160 epochs. Batch size is set to 25. Word embedding size is set to 100, and sequence length is set to 35. Gradient clipping size is set to 0.25. Cross-entropy is used as the loss function. For each algorithm, we run each algorithm for 40 epochs. Initial learning rate η is chosen from {1, 2, 5, 10, 20}, and is divided by 4 every 10 epochs. |