AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning

Authors: Hao Zhang, Yuan Li, Zhijie Deng, Xiaodan Liang, Lawrence Carin, Eric Xing

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Auto Sync on a broad set of models and clusters, and show that there exist ample strategies in the proposed space that outperform hand-optimized systems by a significant margin. Auto Sync can effectively find strategies that reduce the training time by 1.2x 1.6x than hand-optimized ones on multiple, difficult-to-parallelize model architectures (e.g. NCF [13], BERT [7] and VGG16 [30]), within an acceptable budget. We conduct experiments on two clusters (D): (1) Cluster A is an in-house cluster with 11 nodes, each equipped with a TITAN X GPU and 40Gb E Ethernet switch; (2) Cluster B is based on AWS, consists of 4x g4dn.12xlarge nodes, each with 4 NVIDIA T4 GPUs and 50Gb E full bandwidth.
Researcher Affiliation Collaboration 1Petuum Inc., 2Carnegie Mellon University, 3Duke University, 4Tsinghua University
Pseudocode No The paper describes the proposed methods in narrative text and with mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1The data and code accompanying this paper are available at https://github.com/petuum/autodist.
Open Datasets Yes As an additional contribution, we collect a dataset containing nearly 10000 data points containing (model, resource, strategy) tuples and their corresponding runtime on real clusters. We share the dataset with the community to encourage extended studies.1 The data and code accompanying this paper are available at https://github.com/petuum/autodist.
Dataset Splits No The paper refers to training models with 'standard settings suggested by MLPerf [22]' but does not explicitly provide the specific training, validation, and test dataset splits in terms of percentages or sample counts for the deep learning models used in the experiments.
Hardware Specification Yes We conduct experiments on two clusters (D): (1) Cluster A is an in-house cluster with 11 nodes, each equipped with a TITAN X GPU and 40Gb E Ethernet switch; (2) Cluster B is based on AWS, consists of 4x g4dn.12xlarge nodes, each with 4 NVIDIA T4 GPUs and 50Gb E full bandwidth.
Software Dependencies No The paper mentions 'Tensor Flow 2.0' and 'NCCL version' but only provides a specific version number for TensorFlow. It does not list version numbers for other key software components or NCCL.
Experiment Setup No The paper mentions conducting synchronous training 'with standard settings suggested by MLPerf [22]' and specifies '10 warm-up iterations, then another 40 iterations of training' for runtime measurement. However, it does not explicitly list concrete hyperparameter values such as learning rate, batch size, or optimizer settings for the models.