ZeRO++: Extremely Efficient Collective Communication for Large Model Training

Authors: Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Xiaoxia Wu, Connor Holmes, Zhewei Yao, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section evaluates Ze RO++ in three areas. First, it shows end-to-end throughput scalability and speedup over baseline for standard and RLHF training across different models, model sizes, hardware configurations and cluster settings, demonstrating consistent speedup (up to 3.3x) across the board. Second, it shows convergence properties of Ze RO++ for both pre-training and finetuning demonstrating its robustness and tolerance to extreme quantization all the way down to 2-bits. Third, it shows ablation studies demonstrating the impact of each component of Ze RO++ and the effectiveness of our kernel optimizations.
Researcher Affiliation Collaboration Microsoft Deep Speed1, Open AI2, Snowflake3, University of Houston4, University of Nevada, Reno5
Pseudocode Yes Algorithm 1: Ze RO algorithm
Open Source Code Yes Please refer to the appendix and our open-sourced evaluation scripts for hyperparameters and other training details.
Open Datasets Yes We analyze the pretraining of GPT-350M and GPT-13B models on the Pile dataset (Biderman et al., 2022), employing Ze RO++ with non-blocked quantization, Ze RO++ (with blocked quantization), and Ze RO-3 as the baseline.
Dataset Splits No The paper mentions using 'validation LM loss' and 'validation perplexity' for evaluation, but does not provide specific details on how the dataset was split into training/validation/test sets, such as percentages, absolute counts, or references to predefined splits.
Hardware Specification Yes Hardware: 24 NVIDIA DGX-2 nodes where each with 16 V100 SXM3 32 GB GPUs. The nodes are connected by Infini Band (IB) with NVIDIA SHARP support, achieving total inter-node bandwidth of over 800 Gbps.
Software Dependencies No The paper mentions 'Py Torch quantization' and custom CUDA kernels, but does not specify version numbers for PyTorch, Python, CUDA, or any other software dependencies used in the experiments.
Experiment Setup No The paper mentions using '2K tokens per GPU' and '1K tokens per GPU' as micro batch sizes and states that 'all hyperparameters remain consistent' but does not explicitly list the specific values of these hyperparameters (e.g., learning rate, optimizer, number of epochs) in the main text.