CO2: Efficient Distributed Training with Full Communication-Computation Overlap

Authors: Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our findings through an extensive set of practical experiments encompassing a wide range of tasks in the fields of computer vision and natural language processing. These experiments serve to demonstrate the capabilities of CO2 in terms of convergence, generalization, and scalability when deployed across configurations comprising up to 128 A100 GPUs.
Researcher Affiliation Academia Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong Open NLPLab, Shanghai AI Laboratory
Pseudocode Yes Algorithm 1: CO2 Algorithm
Open Source Code Yes https://github.com/Open NLPLab/CO2
Open Datasets Yes Image Net-1K dataset", "ADE20K dataset", "Shape Net dataset (Chang et al., 2015)", "Open Web Text dataset (Radford et al., 2019)", "Wiki Text-103", "Wikipedia and Book Corpus (Wettig et al., 2022)
Dataset Splits Yes Image Net-1K dataset, which contains 1.28M training images and 50K validation images from 1, 000 categories. ... The ADE20K dataset is composed of 150 distinct classes, distributed across 20210, 2000, and 3352 images for training, validation, and testing, respectively.
Hardware Specification Yes Our experimental setup comprises up to 16 DGX-A100 servers, with each server featuring 8 A100 GPUs. These GPUs are interconnected via NVSwitch, providing an inter-GPU bandwidth of 600GBps... Part of experiments are conducted on a 3090 GPU cluser with total 10 servers. Each server is equipmented by eight 3090 GPUs.
Software Dependencies Yes Experiments are implemented in Py Torch 1.13.0 with CUDA 11.7, cu DNN 8.0, and NCCL 2.14.3. Our algorithm is developed upon Fair Scale 0.4.13.
Experiment Setup Yes Hyperparameters for each approach were meticulously tuned for maximum performance. For CO2, we conducted an extensive hyperparameter search for τ, ranging from {1, 3, 6, 12, 24, 48, 96, 192} to find the optimal balance between efficiency and performance. ... For the training of Res Net-50, we use a total mini-batch size of 8192 and train for 90 epochs with a cosine learning rate schedule. ... All experiments use a context length of 1024 and a total batch size of 480, training on eight A100 nodes with 64 GPUs.