Decentralized Training of Foundation Models in Heterogeneous Environments

Authors: Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S. Liang, Christopher Ré, Ce Zhang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8 faster than prior state-of-the-art training systems.
Researcher Affiliation Collaboration 1ETH Zürich, Switzerland 2Stanford University, USA 3Carnegie Mellon University 4Meta AI
Pseudocode No The paper describes algorithms (e.g., 'bi-level scheduling algorithm', 'evolutionary algorithm', 'local search strategy') but does not provide them in structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at: https://github.com/DS3Lab/DT-FM.
Open Datasets No The paper mentions using a 'standard GPT3-1.3B architecture' and following 'standard settings for GPT3-1.3B training', but it does not provide concrete access information (link, citation with authors/year, or specific repository) for the dataset used for this training.
Dataset Splits No The paper mentions following 'standard settings for GPT3-1.3B training' but does not explicitly state specific dataset split percentages, sample counts, or provide citations to predefined splits for training, validation, and testing.
Hardware Specification Yes To simulate the decentralized setting, we use 8 different AWS regions (Oregon, Virginia, Ohio, Tokyo, Seoul, London, Frankfurt, and Ireland) and measure the latency and bandwidth between these regions (...) We use 64 Tesla V100 GPUs (...) Case 1. Data center on demand. (...) we use 8 AWS p3.16xlarge nodes (each with 8 V100 GPUs) (...) Case 2. Data center spot instances. (...) we rent 4 AWS p3.8xlarge nodes (each with 4 V100) and 32 p3.2xlarge nodes (each with 1 V100);
Software Dependencies No Our systems was built on Py Torch and Cu Py and was evaluated using publicly available datasets. All of them use open-source licenses and can be used for non-commercial educational purposes. The paper does not provide specific version numbers for PyTorch or CuPy.
Experiment Setup Yes We use the standard GPT3-1.3B architecture [2], while also benchmarked different number of layers {24, 32, 40}, and batch sizes {1024, 2048, 4096}. Tuning of Megatron and Deepspeed. We did a careful grid search of different parallelism settings and report the optimal results in each case in Case 1, the optimal setting includes tensor model parallelism in Megatron and Ze RO-S3 in Deepspeed; in all other cases, the optimal settings are based on pipeline and data parallelism.