SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training

Authors: Yangrui Chen, Cong Xie, Meng Ma, Juncheng Gu, Yanghua Peng, Haibin Lin, Chuan Wu, Yibo Zhu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that SAPipe achieves up to 157% speedups over Byte PS (non-stale), and outperforms Pipe SGD in accuracy by up to 13.7%.
Researcher Affiliation Collaboration Yangrui Chen The University of Hong Kong yrchen@cs.hku.hk Cong Xie Byte Dance cong.xie@bytedance.com Meng Ma Byte Dance meng.ma@bytedance.com Juncheng Gu Byte Dance juncheng.gu@bytedance.com Yanghua Peng Byte Dance pengyanghua.yanghua@bytedance.com Haibin Lin Byte Dance haibin.lin@bytedance.com Chuan Wu The University of Hong Kong cwu@cs.hku.hk Yibo Zhu Byte Dance zhuyibo@bytedance.com
Pseudocode Yes Algorithm 1 Distributed Training / Staleness Training Pipeline (Pipe SGD), Algorithm 2 Staleness-Aware Pipeline with Delay Compensation (SAPipe-DC), Algorithm 3 Staleness-Aware Pipeline with Weight Prediction (SAPipe-WP)
Open Source Code Yes Code: https://github.com/Chen Aris/sapipe.git
Open Datasets Yes We train CV models on two datasets: (i) CIFAR-10 [16] and (ii) Image Net [17]. We fine-tune the pretrained GPT-2 model on (iii) Wiki Text-2 language modeling dataset [20]. The Transformer model is trained on (iv) Multi30K [8] for WMT16 English-to-German Multimodal Translation task.
Dataset Splits Yes We train CV models on two datasets: (i) CIFAR-10 [16] and (ii) Image Net [17]. We fine-tune the pretrained GPT-2 model on (iii) Wiki Text-2 language modeling dataset [20]. The Transformer model is trained on (iv) Multi30K [8] for WMT16 English-to-German Multimodal Translation task.
Hardware Specification Yes We evaluate SAPipe 2 on 8 physical machines, each equipped with 90 CPU cores, 320GB memory, 8 Tesla V100 GPUs with NVLinks, and 100Gbps bandwidth between any two machines.
Software Dependencies No The paper mentions 'Byte PS framework, compatible to both Tensor Flow and Py Torch' and 'All baselines and SAPipe are run on Py Torch computation framework.' However, it does not provide specific version numbers for these software components.
Experiment Setup Yes The batch sizes per GPU are 128 images, 128 images, 80 tokens and 3200 tokens, respectively. We adopt SGD optimizer with 0.9 Polyak s momentum [24] and 5e-5 weight decay when training VGG16 and Res Net50 models, and Adam [14] optimizer with (0.9, 0.98) betas for NLP models. The global learning rates for VGG16, Res Net50 and GPT-2 are 0.1, 0.1, and 5e-5, respectively... SAPipe uses Option 3 in Algorithm 3 as the default staleness compensation method, with λ empirically set as 0.2.