reproducibilityindex.ai

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

Authors: Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our findings through an extensive set of practical experiments encompassing a wide range of tasks in the fields of computer vision and natural language processing. These experiments serve to demonstrate the capabilities of CO2 in terms of convergence, generalization, and scalability when deployed across configurations comprising up to 128 A100 GPUs.
Researcher Affiliation	Academia	Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong Open NLPLab, Shanghai AI Laboratory
Pseudocode	Yes	Algorithm 1: CO2 Algorithm
Open Source Code	Yes	https://github.com/Open NLPLab/CO2
Open Datasets	Yes	Image Net-1K dataset", "ADE20K dataset", "Shape Net dataset (Chang et al., 2015)", "Open Web Text dataset (Radford et al., 2019)", "Wiki Text-103", "Wikipedia and Book Corpus (Wettig et al., 2022)
Dataset Splits	Yes	Image Net-1K dataset, which contains 1.28M training images and 50K validation images from 1, 000 categories. ... The ADE20K dataset is composed of 150 distinct classes, distributed across 20210, 2000, and 3352 images for training, validation, and testing, respectively.
Hardware Specification	Yes	Our experimental setup comprises up to 16 DGX-A100 servers, with each server featuring 8 A100 GPUs. These GPUs are interconnected via NVSwitch, providing an inter-GPU bandwidth of 600GBps... Part of experiments are conducted on a 3090 GPU cluser with total 10 servers. Each server is equipmented by eight 3090 GPUs.
Software Dependencies	Yes	Experiments are implemented in Py Torch 1.13.0 with CUDA 11.7, cu DNN 8.0, and NCCL 2.14.3. Our algorithm is developed upon Fair Scale 0.4.13.
Experiment Setup	Yes	Hyperparameters for each approach were meticulously tuned for maximum performance. For CO2, we conducted an extensive hyperparameter search for τ, ranging from {1, 3, 6, 12, 24, 48, 96, 192} to find the optimal balance between efficiency and performance. ... For the training of Res Net-50, we use a total mini-batch size of 8192 and train for 90 epochs with a cosine learning rate schedule. ... All experiments use a context length of 1024 and a total batch size of 480, training on eight A100 nodes with 64 GPUs.