Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Authors: Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and Image Net, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2 higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and Image Net validation set.
Researcher Affiliation Collaboration Yucheng Lu Cornell University Conglong Li Microsoft Minjia Zhang Microsoft Christopher De Sa Cornell University Yuxiong He Microsoft
Pseudocode Yes Algorithm 1 Proposed 0/1 Adam Algorithm
Open Source Code Yes The 0/1 Adam optimizer and corresponding experimental scripts (e.g. BERT pre-training and GLUE finetuning) have been open sourced in a deep learning optimization library called Deep Speed2. https://github.com/microsoft/Deep Speed
Open Datasets Yes For BERT model, we use the same dataset as (Devlin et al., 2018), which is a concatenation of Wikipedia and Books Corpus with 2.5B and 800M words respectively. ... For Image Net, we adopt Image Net-1k dataset, which contains 1.28M images for training and 50K images for validation (Deng et al., 2009). ... For training data, we adopt the same dataset blend as in (Shoeybi et al., 2019): Wikipedia (Devlin et al., 2018), CC-Stories (Trinh and Le, 2018), Real News (Zellers et al., 2019), and Open Webtext (Radford et al., 2019b).
Dataset Splits Yes For Image Net, we adopt Image Net-1k dataset, which contains 1.28M images for training and 50K images for validation (Deng et al., 2009). ... We use the GLUE fine-tuning benchmark (Wang et al., 2018b) to evaluate the convergence of the BERT models trained by different algorithms. ... Table 1: GLUE development set results.
Hardware Specification Yes We evaluate two clusters: one with 4 NVIDIA V100 GPUs per node and 40 Gigabit Ethernet inter-node network (2.7 Gbps effective bandwidth); the other one with 8 V100 GPUs per node and 100 Gigabit Infini Band EDR inter-node network (close to theoretical peak effective bandwidth).
Software Dependencies No The paper mentions using 'Deep Speed' and 'Pytorch' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For BERT pretraining, we follow the settings from (Devlin et al., 2018) and let the learning rate linearly increases to 4 × 10−4 as a warmup in the first 12.5K steps, then decays into 0.99 of the original after every 520 steps. We set β1 =0.9 and β2 =0.999 for all the algorithms. We adopt the batch size of 4096. ... For Image Net, we follow the example script from Pytorch7 and use batch size of 256 and a milestone decay learning rate schedule: starting at 1e-4 and decay by a factor of 10 at epoch 30 and 60, with 90 epochs in total.