reproducibilityindex.ai

Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training

Authors: Hanyang Peng, Shuang Qin, Yue Yu, Jin Wang, Hui Wang, Ge Li

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments, conducted on 8 to 64 GPUs (1 to 8 nodes) using DDP, demonstrate that Birder achieves comparable inference performance to uncompressed SGDM/Adam, with up to 2.5 speedup for training Res Net-50 and 6.3 speedup for training BERT-Base.
Researcher Affiliation	Academia	Hanyang Peng1 , Shuang Qin1 , Yue Yu1 , Jin Wang1, Hui Wang1, Ge Li2 1Peng Cheng Laboratory, Shenzhen, China 2School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen, China
Pseudocode	Yes	Algorithm 1. Birder
Open Source Code	Yes	Code is publicly available at https://openi.pcl.ac.cn/c2net_optim/Birder.
Open Datasets	Yes	For the experiments over Res Net-50, we evaluate the convergence and performance of SGDM, 1-bit Adam and Birder on ILSVRC2012... For the experiments over BERT-Base, we access the convergence and performance of Bert Adam (baseline), 1-bit Adam and Birder for SQu AD 1.1 ﬁne-tuning task using a pre-trained BERT-Base model checkpoint from Hugging Face.
Dataset Splits	No	The paper uses well-known datasets (ILSVRC2012, SQuAD 1.1, CIFAR100, Penn Tree Bank, Wikipedia) which often have standard validation splits, but it does not explicitly state the details of a validation split (e.g., percentages, sample counts, or how it was used) for its experiments.
Hardware Specification	Yes	Our experiments were conducted on a testbed consisting of 1, 2, 4, 8 nodes interconnected via 10Gbps Ethernet. Each node was equipped with 8 Nvidia Tesla A100-80GB GPUs.
Software Dependencies	Yes	Py Torch 1.11.0 was used as the primary framework, accompanied by CUDA-11.6, cu DNN-8.2, NCCL-2.10.3, and Py Torch 1.11.0 for other relevant libraries.
Experiment Setup	Yes	For the experiments over Res Net-50, we evaluate the convergence and performance of SGDM, 1-bit Adam and Birder on ILSVRC2012. The batch size per GPU is set to 32 or 128... When employing SGDM (baseline), the learning rate starts at 0.1 batch size / 256 with momentum of 0.9 and weight decay of 0.0001. When employing 1-bit Adam and Birder, the learning rate starts at 0.001 batch size / 256 with weight decay of 0.0001, and [β1, β2] for 1-bit Adam is set to [0.9, 0.999] and β for Birder is set to 0.95. Then, the learning rate is divided by 10 after 30, 60 and 90 epochs, and training is ﬁnally terminated after 100 epochs.