ZeRO++: Extremely Efficient Collective Communication for Large Model Training
Authors: Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Xiaoxia Wu, Connor Holmes, Zhewei Yao, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section evaluates Ze RO++ in three areas. First, it shows end-to-end throughput scalability and speedup over baseline for standard and RLHF training across different models, model sizes, hardware configurations and cluster settings, demonstrating consistent speedup (up to 3.3x) across the board. Second, it shows convergence properties of Ze RO++ for both pre-training and finetuning demonstrating its robustness and tolerance to extreme quantization all the way down to 2-bits. Third, it shows ablation studies demonstrating the impact of each component of Ze RO++ and the effectiveness of our kernel optimizations. |
| Researcher Affiliation | Collaboration | Microsoft Deep Speed1, Open AI2, Snowflake3, University of Houston4, University of Nevada, Reno5 |
| Pseudocode | Yes | Algorithm 1: Ze RO algorithm |
| Open Source Code | Yes | Please refer to the appendix and our open-sourced evaluation scripts for hyperparameters and other training details. |
| Open Datasets | Yes | We analyze the pretraining of GPT-350M and GPT-13B models on the Pile dataset (Biderman et al., 2022), employing Ze RO++ with non-blocked quantization, Ze RO++ (with blocked quantization), and Ze RO-3 as the baseline. |
| Dataset Splits | No | The paper mentions using 'validation LM loss' and 'validation perplexity' for evaluation, but does not provide specific details on how the dataset was split into training/validation/test sets, such as percentages, absolute counts, or references to predefined splits. |
| Hardware Specification | Yes | Hardware: 24 NVIDIA DGX-2 nodes where each with 16 V100 SXM3 32 GB GPUs. The nodes are connected by Infini Band (IB) with NVIDIA SHARP support, achieving total inter-node bandwidth of over 800 Gbps. |
| Software Dependencies | No | The paper mentions 'Py Torch quantization' and custom CUDA kernels, but does not specify version numbers for PyTorch, Python, CUDA, or any other software dependencies used in the experiments. |
| Experiment Setup | No | The paper mentions using '2K tokens per GPU' and '1K tokens per GPU' as micro batch sizes and states that 'all hyperparameters remain consistent' but does not explicitly list the specific values of these hyperparameters (e.g., learning rate, optimizer, number of epochs) in the main text. |