1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed
Authors: Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on up to 256 GPUs show that 1-bit Adam enables up to 3.3 higher throughput for BERT-Large pre-training and up to 2.9 higher throughput for SQu AD fine-tuning. In addition, we provide theoretical analysis for our proposed work. |
| Researcher Affiliation | Collaboration | 1Microsoft 2Department of Computer Science, University of Rochester 3Department of Computer Science, ETH Zurich. |
| Pseudocode | Yes | Algorithm 1 1-bit Adam |
| Open Source Code | Yes | The 1-bit Adam optimizer and the communication primitive backend have been open sourced in a deep learning optimization library called Deep Speed2. 2https://github.com/microsoft/Deep Speed |
| Open Datasets | Yes | We use the same dataset as Devlin et al. (2019), which is a concatenation of Wikipedia and Books Corpus with 2.5B and 800M words respectively. We use the GLUE fine-tuning benchmark(Wang et al., 2018) to evaluate the convergence of the BERT models trained by Adam and 1-bit Adam. In addition, we also evaluate the convergence and performance of 1-bit Adam for SQu AD 1.1 fine-tuning task3 using a pre-trained BERT model checkpoint from Hugging Face4. 3https://rajpurkar.github.io/SQu AD-explorer/ ... We train CIFAR10 using Res Net-18(He et al., 2016). ... Image Net (Russakovsky et al., 2015) ... Celeb Faces Attributes Dataset (Celeb A) (Liu et al., 2015) |
| Dataset Splits | Yes | For GLUE benchmarks we use original Adam optimizer and perform single-task training on the dev set. |
| Hardware Specification | Yes | the first cluster has 4 NVIDIA Tesla V100 GPUs per node... the second cluster has 8 V100 GPUs per node... We run the experiments on 8 1080Ti GPUs where each GPU is used as one worker. |
| Software Dependencies | No | NVIDIA NCCL is an efficient and widely used communication library that has been tightly integrated in DL frameworks like Py Torch and Tensor Flow. ... We design a custom collective primitive using Message Passing Interface (MPI). ... The CUDA-Aware version works only on systems with Infini Band whereas the basic version can run on any system with Ethernet interconnect. MVAPICH2-GDR |
| Experiment Setup | Yes | For BERT pre-training, the learning rate linearly increases to 4 10 4 as a warmup in the first 12.5K steps, then decays into 0.99 of the original after every 520 steps. We set the two parameters in Algorithm 1 as β1 = 0.9 and β2 = 0.999 for 1-bit Adam and Adam. For convergence test, we set total batch size as 4K for BERT-Base and BERT-Large. For SQu AD fine-tuning we use the same parameters as published by Hugging Face (batch size = 24, learning rate=3e 5, dropout=0.1, 2 epochs), except that we increase the batch size to 96 (using 32 GPUs). |