1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed

Authors: Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on up to 256 GPUs show that 1-bit Adam enables up to 3.3 higher throughput for BERT-Large pre-training and up to 2.9 higher throughput for SQu AD fine-tuning. In addition, we provide theoretical analysis for our proposed work.
Researcher Affiliation Collaboration 1Microsoft 2Department of Computer Science, University of Rochester 3Department of Computer Science, ETH Zurich.
Pseudocode Yes Algorithm 1 1-bit Adam
Open Source Code Yes The 1-bit Adam optimizer and the communication primitive backend have been open sourced in a deep learning optimization library called Deep Speed2. 2https://github.com/microsoft/Deep Speed
Open Datasets Yes We use the same dataset as Devlin et al. (2019), which is a concatenation of Wikipedia and Books Corpus with 2.5B and 800M words respectively. We use the GLUE fine-tuning benchmark(Wang et al., 2018) to evaluate the convergence of the BERT models trained by Adam and 1-bit Adam. In addition, we also evaluate the convergence and performance of 1-bit Adam for SQu AD 1.1 fine-tuning task3 using a pre-trained BERT model checkpoint from Hugging Face4. 3https://rajpurkar.github.io/SQu AD-explorer/ ... We train CIFAR10 using Res Net-18(He et al., 2016). ... Image Net (Russakovsky et al., 2015) ... Celeb Faces Attributes Dataset (Celeb A) (Liu et al., 2015)
Dataset Splits Yes For GLUE benchmarks we use original Adam optimizer and perform single-task training on the dev set.
Hardware Specification Yes the first cluster has 4 NVIDIA Tesla V100 GPUs per node... the second cluster has 8 V100 GPUs per node... We run the experiments on 8 1080Ti GPUs where each GPU is used as one worker.
Software Dependencies No NVIDIA NCCL is an efficient and widely used communication library that has been tightly integrated in DL frameworks like Py Torch and Tensor Flow. ... We design a custom collective primitive using Message Passing Interface (MPI). ... The CUDA-Aware version works only on systems with Infini Band whereas the basic version can run on any system with Ethernet interconnect. MVAPICH2-GDR
Experiment Setup Yes For BERT pre-training, the learning rate linearly increases to 4 10 4 as a warmup in the first 12.5K steps, then decays into 0.99 of the original after every 520 steps. We set the two parameters in Algorithm 1 as β1 = 0.9 and β2 = 0.999 for 1-bit Adam and Adam. For convergence test, we set total batch size as 4K for BERT-Base and BERT-Large. For SQu AD fine-tuning we use the same parameters as published by Hugging Face (batch size = 24, learning rate=3e 5, dropout=0.1, 2 epochs), except that we increase the batch size to 96 (using 32 GPUs).