Decentralized SGD and Average-direction SAM are Asymptotically Equivalent
Authors: Tongtian Zhu, Fengxiang He, Kaixuan Chen, Mingli Song, Dacheng Tao
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments support our theory and the code is available at D-SGD and SAM. Our empirical results also fully support our theory (see Figure 1 and Figure 3). |
| Researcher Affiliation | Collaboration | 1College of Computer Science and Technology, Zhejiang University 2JD Explore Academy, JD.com, Inc. 3Artificial Intelligence and its Applications Institute, School of Informatics, University of Edinburgh 4The University of Sydney. |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled "Pseudocode" or "Algorithm", nor does it include structured code-like blocks outlining a procedure. |
| Open Source Code | Yes | Experiments support our theory and the code is available at D-SGD and SAM. (Abstract) and Code is available at D-SGD and SAM. (Section 5) |
| Open Datasets | Yes | D-SGD with various commonly used topologies... and C-SGD are employed to train image classifiers on CIFAR-10 (Krizhevsky et al., 2009) and Tiny Image Net (Le & Yang, 2015)... |
| Dataset Splits | No | The paper mentions using CIFAR-10 and Tiny Image Net datasets and discusses "validation accuracy", but it does not explicitly state the train/validation/test split percentages or methodology for reproducibility. |
| Hardware Specification | Yes | The experiments are conducted on a computing facility with NVIDIA Tesla V100 16GB GPUs and Intel Xeon Gold 6140 CPU @ 2.30GHz CPUs. |
| Software Dependencies | No | The code is based on Py Torch (Paszke et al., 2019). While PyTorch is mentioned, a specific version number (e.g., 1.9, 2.0) is not provided, which is necessary for full reproducibility. |
| Experiment Setup | Yes | The number of workers (one GPU as a worker) is set as 16; and the local batch size is set as 8, 64, and 512 per worker in different cases. For the case of local batch size 64, the initial learning rate is set as 0.1 for Res Net-18 and Res Net-34 and 0.01 for Alex Net... The learning rate is divided by 10 when the model has passed the 2/5 and 4/5 of the total number of iterations (He et al., 2016a). We apply the learning rate warm-up (Smith, 2017) and the linear scaling law (He et al., 2016a; Goyal et al., 2017)... Batch normalization (Ioffe & Szegedy, 2015) is employed in training Alex Net. |