On the insufficiency of existing momentum schemes for Stochastic Optimization

Authors: Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, Sham M. Kakade

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD.
Researcher Affiliation Collaboration Rahul Kidambi 1, Praneeth Netrapalli2, Prateek Jain2 and Sham M. Kakade1 1 University of Washington Seattle 2 Microsoft Research India
Pseudocode Yes Algorithm 1 HB: Heavy ball with a SFO
Open Source Code Yes The code implementing the ASGD Algorithm can be found here1. 1link to the ASGD code: https://github.com/rahulkidambi/Acc SGD
Open Datasets Yes training deep autoencoders for the mnist dataset
Dataset Splits Yes We use a validation set based decay scheme, wherein, after every 3 epochs, we decay the learning rate by a certain factor (which we grid search on) if the validation zero one error does not decrease by at least a certain amount (precise numbers are provided in the appendix since they vary across batch sizes).
Hardware Specification No The paper mentions using Matlab and PyTorch for experiments but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies No We use Matlab to conduct experiments presented in Section 5.1 and use Py Torch (pytorch, 2017) for our deep networks related experiments.
Experiment Setup Yes The network architecture follows previous work (Hinton & Salakhutdinov, 2006) and is represented as 784 1000 500 250 30 250 500 1000 784 with the first and last 784 nodes representing the input and output respectively. All hidden/output nodes employ sigmoid activations except for the layer with 30 nodes which employs linear activations and we use MSE loss. Initialization follows the scheme of Martens (2010), also employed in Sutskever et al. (2013); Martens & Grosse (2015). We perform training with two minibatch sizes 1 and 8. ... We use a validation set based decay scheme, wherein, after every 3 epochs, we decay the learning rate by a certain factor...