Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training
Authors: Hanyang Peng, Shuang Qin, Yue Yu, Jin Wang, Hui Wang, Ge Li
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments, conducted on 8 to 64 GPUs (1 to 8 nodes) using DDP, demonstrate that Birder achieves comparable inference performance to uncompressed SGDM/Adam, with up to 2.5 speedup for training Res Net-50 and 6.3 speedup for training BERT-Base. |
| Researcher Affiliation | Academia | Hanyang Peng1 , Shuang Qin1 , Yue Yu1 , Jin Wang1, Hui Wang1, Ge Li2 1Peng Cheng Laboratory, Shenzhen, China 2School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen, China |
| Pseudocode | Yes | Algorithm 1. Birder |
| Open Source Code | Yes | Code is publicly available at https://openi.pcl.ac.cn/c2net_optim/Birder. |
| Open Datasets | Yes | For the experiments over Res Net-50, we evaluate the convergence and performance of SGDM, 1-bit Adam and Birder on ILSVRC2012... For the experiments over BERT-Base, we access the convergence and performance of Bert Adam (baseline), 1-bit Adam and Birder for SQu AD 1.1 fine-tuning task using a pre-trained BERT-Base model checkpoint from Hugging Face. |
| Dataset Splits | No | The paper uses well-known datasets (ILSVRC2012, SQuAD 1.1, CIFAR100, Penn Tree Bank, Wikipedia) which often have standard validation splits, but it does not explicitly state the details of a validation split (e.g., percentages, sample counts, or how it was used) for its experiments. |
| Hardware Specification | Yes | Our experiments were conducted on a testbed consisting of 1, 2, 4, 8 nodes interconnected via 10Gbps Ethernet. Each node was equipped with 8 Nvidia Tesla A100-80GB GPUs. |
| Software Dependencies | Yes | Py Torch 1.11.0 was used as the primary framework, accompanied by CUDA-11.6, cu DNN-8.2, NCCL-2.10.3, and Py Torch 1.11.0 for other relevant libraries. |
| Experiment Setup | Yes | For the experiments over Res Net-50, we evaluate the convergence and performance of SGDM, 1-bit Adam and Birder on ILSVRC2012. The batch size per GPU is set to 32 or 128... When employing SGDM (baseline), the learning rate starts at 0.1 batch size / 256 with momentum of 0.9 and weight decay of 0.0001. When employing 1-bit Adam and Birder, the learning rate starts at 0.001 batch size / 256 with weight decay of 0.0001, and [β1, β2] for 1-bit Adam is set to [0.9, 0.999] and β for Birder is set to 0.95. Then, the learning rate is divided by 10 after 30, 60 and 90 epochs, and training is finally terminated after 100 epochs. |