Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

Authors: Tongtian Zhu, Fengxiang He, Kaixuan Chen, Mingli Song, Dacheng Tao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments support our theory and the code is available at D-SGD and SAM. Our empirical results also fully support our theory (see Figure 1 and Figure 3).
Researcher Affiliation Collaboration 1College of Computer Science and Technology, Zhejiang University 2JD Explore Academy, JD.com, Inc. 3Artificial Intelligence and its Applications Institute, School of Informatics, University of Edinburgh 4The University of Sydney.
Pseudocode No The paper does not contain any sections or figures explicitly labeled "Pseudocode" or "Algorithm", nor does it include structured code-like blocks outlining a procedure.
Open Source Code Yes Experiments support our theory and the code is available at D-SGD and SAM. (Abstract) and Code is available at D-SGD and SAM. (Section 5)
Open Datasets Yes D-SGD with various commonly used topologies... and C-SGD are employed to train image classifiers on CIFAR-10 (Krizhevsky et al., 2009) and Tiny Image Net (Le & Yang, 2015)...
Dataset Splits No The paper mentions using CIFAR-10 and Tiny Image Net datasets and discusses "validation accuracy", but it does not explicitly state the train/validation/test split percentages or methodology for reproducibility.
Hardware Specification Yes The experiments are conducted on a computing facility with NVIDIA Tesla V100 16GB GPUs and Intel Xeon Gold 6140 CPU @ 2.30GHz CPUs.
Software Dependencies No The code is based on Py Torch (Paszke et al., 2019). While PyTorch is mentioned, a specific version number (e.g., 1.9, 2.0) is not provided, which is necessary for full reproducibility.
Experiment Setup Yes The number of workers (one GPU as a worker) is set as 16; and the local batch size is set as 8, 64, and 512 per worker in different cases. For the case of local batch size 64, the initial learning rate is set as 0.1 for Res Net-18 and Res Net-34 and 0.01 for Alex Net... The learning rate is divided by 10 when the model has passed the 2/5 and 4/5 of the total number of iterations (He et al., 2016a). We apply the learning rate warm-up (Smith, 2017) and the linear scaling law (He et al., 2016a; Goyal et al., 2017)... Batch normalization (Ioffe & Szegedy, 2015) is employed in training Alex Net.