Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training

Authors: Hanyang Peng, Shuang Qin, Yue Yu, Jin Wang, Hui Wang, Ge Li

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments, conducted on 8 to 64 GPUs (1 to 8 nodes) using DDP, demonstrate that Birder achieves comparable inference performance to uncompressed SGDM/Adam, with up to 2.5 speedup for training Res Net-50 and 6.3 speedup for training BERT-Base.
Researcher Affiliation Academia Hanyang Peng1 , Shuang Qin1 , Yue Yu1 , Jin Wang1, Hui Wang1, Ge Li2 1Peng Cheng Laboratory, Shenzhen, China 2School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen, China
Pseudocode Yes Algorithm 1. Birder
Open Source Code Yes Code is publicly available at https://openi.pcl.ac.cn/c2net_optim/Birder.
Open Datasets Yes For the experiments over Res Net-50, we evaluate the convergence and performance of SGDM, 1-bit Adam and Birder on ILSVRC2012... For the experiments over BERT-Base, we access the convergence and performance of Bert Adam (baseline), 1-bit Adam and Birder for SQu AD 1.1 fine-tuning task using a pre-trained BERT-Base model checkpoint from Hugging Face.
Dataset Splits No The paper uses well-known datasets (ILSVRC2012, SQuAD 1.1, CIFAR100, Penn Tree Bank, Wikipedia) which often have standard validation splits, but it does not explicitly state the details of a validation split (e.g., percentages, sample counts, or how it was used) for its experiments.
Hardware Specification Yes Our experiments were conducted on a testbed consisting of 1, 2, 4, 8 nodes interconnected via 10Gbps Ethernet. Each node was equipped with 8 Nvidia Tesla A100-80GB GPUs.
Software Dependencies Yes Py Torch 1.11.0 was used as the primary framework, accompanied by CUDA-11.6, cu DNN-8.2, NCCL-2.10.3, and Py Torch 1.11.0 for other relevant libraries.
Experiment Setup Yes For the experiments over Res Net-50, we evaluate the convergence and performance of SGDM, 1-bit Adam and Birder on ILSVRC2012. The batch size per GPU is set to 32 or 128... When employing SGDM (baseline), the learning rate starts at 0.1 batch size / 256 with momentum of 0.9 and weight decay of 0.0001. When employing 1-bit Adam and Birder, the learning rate starts at 0.001 batch size / 256 with weight decay of 0.0001, and [β1, β2] for 1-bit Adam is set to [0.9, 0.999] and β for Birder is set to 0.95. Then, the learning rate is divided by 10 after 30, 60 and 90 epochs, and training is finally terminated after 100 epochs.