A Unified Analysis of Stochastic Momentum Methods for Deep Learning
Authors: Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, Yi Yang
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | These theoretical insights verify the common wisdom and are also corroborated by our empirical analysis on deep learning.Our empirical results for learning deep neural networks complete the unified view and analysis by showing that (i) there is no clear advantage of SHB and SNAG over SG in convergence speed of the training error; (ii) the advantage of SHB and SNAG lies at better generalization due to more stability; (iii) SNAG usually achieves the best tradeoff between speed of convergence in training error and stability of testing error among the three stochastic methods. |
| Researcher Affiliation | Academia | 1 SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology 2 Centre for Artificial Intelligence, University of Technology Sydney 3 Department of Computer Science, The University of Iowa 4 Tippie College of Business, The University of Iowa |
| Pseudocode | No | The paper describes update rules using mathematical equations (e.g., 'SHB: xk+1 = xk αG(xk; ξk) + β(xk xk 1)') but does not include a structured pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide any concrete access information for source code, such as a repository link or an explicit statement about code release in supplementary materials. |
| Open Datasets | Yes | We train a deep convolutional neural network (CNN) for classification on two benchmark datasets, i.e., CIFAR-10 and CIFAR-100. |
| Dataset Splits | Yes | Both datasets contain 50, 000 training images of size 32 x 32 from 10 classes (CIFAR-10) or 100 classes (CIFAR-100) and 10, 000 testing images of the same size. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or processor types) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names or solver names with version numbers. |
| Experiment Setup | Yes | We fix the momentum constant β = 0.9 and the regularization parameter of weights to 0.0005. We use a mini-batch of size 128 to compute a stochastic gradient at each iteration. All three methods use the same initialization. We follow the procedure in [Krizhevsky et al., 2012] to set the step size α, i.e., initially giving a relatively large step size and and decreasing the step size by 10 times after certain number of iterations when observing the performance on testing data saturates. In particular, for SHB the best initial step size is 0.001 and that for SNAG and SG is 0.01. |