reproducibilityindex.ai

ZeRO++: Extremely Efficient Collective Communication for Large Model Training

Authors: Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Xiaoxia Wu, Connor Holmes, Zhewei Yao, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section evaluates Ze RO++ in three areas. First, it shows end-to-end throughput scalability and speedup over baseline for standard and RLHF training across different models, model sizes, hardware configurations and cluster settings, demonstrating consistent speedup (up to 3.3x) across the board. Second, it shows convergence properties of Ze RO++ for both pre-training and finetuning demonstrating its robustness and tolerance to extreme quantization all the way down to 2-bits. Third, it shows ablation studies demonstrating the impact of each component of Ze RO++ and the effectiveness of our kernel optimizations.
Researcher Affiliation	Collaboration	Microsoft Deep Speed1, Open AI2, Snowflake3, University of Houston4, University of Nevada, Reno5
Pseudocode	Yes	Algorithm 1: Ze RO algorithm
Open Source Code	Yes	Please refer to the appendix and our open-sourced evaluation scripts for hyperparameters and other training details.
Open Datasets	Yes	We analyze the pretraining of GPT-350M and GPT-13B models on the Pile dataset (Biderman et al., 2022), employing Ze RO++ with non-blocked quantization, Ze RO++ (with blocked quantization), and Ze RO-3 as the baseline.
Dataset Splits	No	The paper mentions using 'validation LM loss' and 'validation perplexity' for evaluation, but does not provide specific details on how the dataset was split into training/validation/test sets, such as percentages, absolute counts, or references to predefined splits.
Hardware Specification	Yes	Hardware: 24 NVIDIA DGX-2 nodes where each with 16 V100 SXM3 32 GB GPUs. The nodes are connected by Infini Band (IB) with NVIDIA SHARP support, achieving total inter-node bandwidth of over 800 Gbps.
Software Dependencies	No	The paper mentions 'Py Torch quantization' and custom CUDA kernels, but does not specify version numbers for PyTorch, Python, CUDA, or any other software dependencies used in the experiments.
Experiment Setup	No	The paper mentions using '2K tokens per GPU' and '1K tokens per GPU' as micro batch sizes and states that 'all hyperparameters remain consistent' but does not explicitly list the specific values of these hyperparameters (e.g., learning rate, optimizer, number of epochs) in the main text.