Decentralized Training of Foundation Models in Heterogeneous Environments
Authors: Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S. Liang, Christopher Ré, Ce Zhang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8 faster than prior state-of-the-art training systems. |
| Researcher Affiliation | Collaboration | 1ETH Zürich, Switzerland 2Stanford University, USA 3Carnegie Mellon University 4Meta AI |
| Pseudocode | No | The paper describes algorithms (e.g., 'bi-level scheduling algorithm', 'evolutionary algorithm', 'local search strategy') but does not provide them in structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at: https://github.com/DS3Lab/DT-FM. |
| Open Datasets | No | The paper mentions using a 'standard GPT3-1.3B architecture' and following 'standard settings for GPT3-1.3B training', but it does not provide concrete access information (link, citation with authors/year, or specific repository) for the dataset used for this training. |
| Dataset Splits | No | The paper mentions following 'standard settings for GPT3-1.3B training' but does not explicitly state specific dataset split percentages, sample counts, or provide citations to predefined splits for training, validation, and testing. |
| Hardware Specification | Yes | To simulate the decentralized setting, we use 8 different AWS regions (Oregon, Virginia, Ohio, Tokyo, Seoul, London, Frankfurt, and Ireland) and measure the latency and bandwidth between these regions (...) We use 64 Tesla V100 GPUs (...) Case 1. Data center on demand. (...) we use 8 AWS p3.16xlarge nodes (each with 8 V100 GPUs) (...) Case 2. Data center spot instances. (...) we rent 4 AWS p3.8xlarge nodes (each with 4 V100) and 32 p3.2xlarge nodes (each with 1 V100); |
| Software Dependencies | No | Our systems was built on Py Torch and Cu Py and was evaluated using publicly available datasets. All of them use open-source licenses and can be used for non-commercial educational purposes. The paper does not provide specific version numbers for PyTorch or CuPy. |
| Experiment Setup | Yes | We use the standard GPT3-1.3B architecture [2], while also benchmarked different number of layers {24, 32, 40}, and batch sizes {1024, 2048, 4096}. Tuning of Megatron and Deepspeed. We did a careful grid search of different parallelism settings and report the optimal results in each case in Case 1, the optimal setting includes tensor model parallelism in Megatron and Ze RO-S3 in Deepspeed; in all other cases, the optimal settings are based on pipeline and data parallelism. |