A Unified Analysis of Stochastic Momentum Methods for Deep Learning

Authors: Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, Yi Yang

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental These theoretical insights verify the common wisdom and are also corroborated by our empirical analysis on deep learning.Our empirical results for learning deep neural networks complete the unified view and analysis by showing that (i) there is no clear advantage of SHB and SNAG over SG in convergence speed of the training error; (ii) the advantage of SHB and SNAG lies at better generalization due to more stability; (iii) SNAG usually achieves the best tradeoff between speed of convergence in training error and stability of testing error among the three stochastic methods.
Researcher Affiliation Academia 1 SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology 2 Centre for Artificial Intelligence, University of Technology Sydney 3 Department of Computer Science, The University of Iowa 4 Tippie College of Business, The University of Iowa
Pseudocode No The paper describes update rules using mathematical equations (e.g., 'SHB: xk+1 = xk αG(xk; ξk) + β(xk xk 1)') but does not include a structured pseudocode or algorithm block.
Open Source Code No The paper does not provide any concrete access information for source code, such as a repository link or an explicit statement about code release in supplementary materials.
Open Datasets Yes We train a deep convolutional neural network (CNN) for classification on two benchmark datasets, i.e., CIFAR-10 and CIFAR-100.
Dataset Splits Yes Both datasets contain 50, 000 training images of size 32 x 32 from 10 classes (CIFAR-10) or 100 classes (CIFAR-100) and 10, 000 testing images of the same size.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or processor types) used for running the experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names or solver names with version numbers.
Experiment Setup Yes We fix the momentum constant β = 0.9 and the regularization parameter of weights to 0.0005. We use a mini-batch of size 128 to compute a stochastic gradient at each iteration. All three methods use the same initialization. We follow the procedure in [Krizhevsky et al., 2012] to set the step size α, i.e., initially giving a relatively large step size and and decreasing the step size by 10 times after certain number of iterations when observing the performance on testing data saturates. In particular, for SHB the best initial step size is 0.001 and that for SNAG and SG is 0.01.