AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning
Authors: Hao Zhang, Yuan Li, Zhijie Deng, Xiaodan Liang, Lawrence Carin, Eric Xing
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Auto Sync on a broad set of models and clusters, and show that there exist ample strategies in the proposed space that outperform hand-optimized systems by a significant margin. Auto Sync can effectively find strategies that reduce the training time by 1.2x 1.6x than hand-optimized ones on multiple, difficult-to-parallelize model architectures (e.g. NCF [13], BERT [7] and VGG16 [30]), within an acceptable budget. We conduct experiments on two clusters (D): (1) Cluster A is an in-house cluster with 11 nodes, each equipped with a TITAN X GPU and 40Gb E Ethernet switch; (2) Cluster B is based on AWS, consists of 4x g4dn.12xlarge nodes, each with 4 NVIDIA T4 GPUs and 50Gb E full bandwidth. |
| Researcher Affiliation | Collaboration | 1Petuum Inc., 2Carnegie Mellon University, 3Duke University, 4Tsinghua University |
| Pseudocode | No | The paper describes the proposed methods in narrative text and with mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1The data and code accompanying this paper are available at https://github.com/petuum/autodist. |
| Open Datasets | Yes | As an additional contribution, we collect a dataset containing nearly 10000 data points containing (model, resource, strategy) tuples and their corresponding runtime on real clusters. We share the dataset with the community to encourage extended studies.1 The data and code accompanying this paper are available at https://github.com/petuum/autodist. |
| Dataset Splits | No | The paper refers to training models with 'standard settings suggested by MLPerf [22]' but does not explicitly provide the specific training, validation, and test dataset splits in terms of percentages or sample counts for the deep learning models used in the experiments. |
| Hardware Specification | Yes | We conduct experiments on two clusters (D): (1) Cluster A is an in-house cluster with 11 nodes, each equipped with a TITAN X GPU and 40Gb E Ethernet switch; (2) Cluster B is based on AWS, consists of 4x g4dn.12xlarge nodes, each with 4 NVIDIA T4 GPUs and 50Gb E full bandwidth. |
| Software Dependencies | No | The paper mentions 'Tensor Flow 2.0' and 'NCCL version' but only provides a specific version number for TensorFlow. It does not list version numbers for other key software components or NCCL. |
| Experiment Setup | No | The paper mentions conducting synchronous training 'with standard settings suggested by MLPerf [22]' and specifies '10 warm-up iterations, then another 40 iterations of training' for runtime measurement. However, it does not explicitly list concrete hyperparameter values such as learning rate, batch size, or optimizer settings for the models. |