CO2: Efficient Distributed Training with Full Communication-Computation Overlap
Authors: Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our findings through an extensive set of practical experiments encompassing a wide range of tasks in the fields of computer vision and natural language processing. These experiments serve to demonstrate the capabilities of CO2 in terms of convergence, generalization, and scalability when deployed across configurations comprising up to 128 A100 GPUs. |
| Researcher Affiliation | Academia | Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong Open NLPLab, Shanghai AI Laboratory |
| Pseudocode | Yes | Algorithm 1: CO2 Algorithm |
| Open Source Code | Yes | https://github.com/Open NLPLab/CO2 |
| Open Datasets | Yes | Image Net-1K dataset", "ADE20K dataset", "Shape Net dataset (Chang et al., 2015)", "Open Web Text dataset (Radford et al., 2019)", "Wiki Text-103", "Wikipedia and Book Corpus (Wettig et al., 2022) |
| Dataset Splits | Yes | Image Net-1K dataset, which contains 1.28M training images and 50K validation images from 1, 000 categories. ... The ADE20K dataset is composed of 150 distinct classes, distributed across 20210, 2000, and 3352 images for training, validation, and testing, respectively. |
| Hardware Specification | Yes | Our experimental setup comprises up to 16 DGX-A100 servers, with each server featuring 8 A100 GPUs. These GPUs are interconnected via NVSwitch, providing an inter-GPU bandwidth of 600GBps... Part of experiments are conducted on a 3090 GPU cluser with total 10 servers. Each server is equipmented by eight 3090 GPUs. |
| Software Dependencies | Yes | Experiments are implemented in Py Torch 1.13.0 with CUDA 11.7, cu DNN 8.0, and NCCL 2.14.3. Our algorithm is developed upon Fair Scale 0.4.13. |
| Experiment Setup | Yes | Hyperparameters for each approach were meticulously tuned for maximum performance. For CO2, we conducted an extensive hyperparameter search for τ, ranging from {1, 3, 6, 12, 24, 48, 96, 192} to find the optimal balance between efficiency and performance. ... For the training of Res Net-50, we use a total mini-batch size of 8192 and train for 90 epochs with a cosine learning rate schedule. ... All experiments use a context length of 1024 and a total batch size of 480, training on eight A100 nodes with 64 GPUs. |