Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On the Generalization of Stochastic Gradient Descent with Momentum
Authors: Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we validate the insights obtained in our theoretical results using experimental evaluation. Our main goal is to study how adding momentum affects the generalization and convergence of SGD. We first investigate the performance of SGDEM when applied to both CIFAR10 (Krizhevsky) and not MINIST datasets for nonconvex loss functions. |
| Researcher Affiliation | Academia | Ali Ramezani-Kebrya EMAIL Department of Informatics, University of Oslo and Visual Intelligence Centre Integreat, Norwegian Centre for Knowledge-driven Machine Learning Gaustadalléen 23B, Ole-Johan Dahls hus, 0373 Oslo, Norway Kimon Antonakopoulos EMAIL Laboratory for Information and Inference Systems (LIONS), EPFL EPFL STI IEL LIONS, Station 11, CH-1015 Lausanne, Switzerland Volkan Cevher EMAIL Laboratory for Information and Inference Systems (LIONS), EPFL EPFL STI IEL LIONS, Station 11, CH-1015 Lausanne, Switzerland Ashish Khisti EMAIL Department of Electrical and Computer Engineering, University of Toronto 40 St. George Street, Toronto, ON M5S 2E4, Canada Ben Liang EMAIL Department of Electrical and Computer Engineering, University of Toronto 40 St. George Street, Toronto, ON M5S 2E4, Canada |
| Pseudocode | No | The paper describes algorithms like SGDM and SGDEM using mathematical update rules (e.g., "wt+1 = wt + µ(wt wt 1) αt wℓ(wt; zit) (SGDM)") rather than explicit pseudocode blocks or algorithm listings. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code, a link to a code repository, or mention of code in supplementary materials. |
| Open Datasets | Yes | We first investigate the performance of SGDEM when applied to both CIFAR10 (Krizhevsky) and not MINIST datasets for nonconvex loss functions. Fig. 1: Validation loss and generalization error of SGDEM when training Res Net-18 (He et al., 2016) on Image Net (Deng et al., 2009) in a distributed setting with 4 GPUs under tuned step-size and global minibatch size of 128. |
| Dataset Splits | Yes | Fig. 1: Validation loss and generalization error of SGDEM when training Res Net-18 (He et al., 2016) on Image Net (Deng et al., 2009) in a distributed setting with 4 GPUs under tuned step-size and global minibatch size of 128. For each td, the momentum is set to µd = 0.9 in the first td epochs and then zero for the next 90 td epochs. SGDM is a special form of SGDEM with td = 90. The details are provided in Section 5 and Appendix L . Fig. 3: Validation accuracy and generalization gap of SGDEM when training Res Net-18 on Image Net in a distributed setting with 4 GPUs under tuned step-size and global minibatch size of 128. |
| Hardware Specification | Yes | Details of Image Net experiments. The global minibatch size and weight decay are set to 128 and 5 10 5, respectively. For each td, the momentum is set to µd = 0.9 in the first td epochs and then zero for the next 90 td epochs. We use a cluster with 4 NVIDIA 2080 Ti GPUs with the following CPU details: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz; 48 cores; GPU2GPU bandwidth: unidirectional 10GB/s and bidirectional 15GB/s. |
| Software Dependencies | No | The paper mentions various algorithms and models like SGD, SGDM, SGDEM, Res Net-18, Res Net-20, and Adam, but it does not specify any software dependencies (e.g., libraries, frameworks) with version numbers that would be required for replication. |
| Experiment Setup | Yes | We set T to 50000 and 14000 for CIFAR10 and not MNIST experiments, respectively. For each value of µd, we add momentum for 0-10 epochs. For each pair of (µd, td), we repeat the experiments 10 times with random initializations. SGDM can be viewed as a special form of SGDEM when the momentum is added for the entire training (i.e., td = T). For 10 epochs and without data augmentation, we train Res Net-20 on CIFAR10 and a feedforward fully connected neural network with 1000 hidden nodes on not MINIST. ... We set the step-size α = 0.01. The minibatch size is set to 10. We use 10 (SGDEM) realizations to evaluate the average performance. |