Equilibrated adaptive learning rates for non-convex optimization

Authors: Yann Dauphin, Harm de Vries, Yoshua Bengio

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.
Researcher Affiliation Academia Yann N. Dauphin1 Universit e de Montr eal dauphiya@iro.umontreal.ca Harm de Vries1 Universit e de Montr eal devries@iro.umontreal.ca Yoshua Bengio Universit e de Montr eal yoshua.bengio@umontreal.ca
Pseudocode Yes The pseudo-code is given in Algorithm 1.
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We use the standard network architectures described in Martens (2010) for the MNIST and CURVES dataset. Both of these datasets have 784 input dimensions and 60,000 and 20,000 examples respectively.
Dataset Splits No The paper mentions using the MNIST and CURVES datasets but does not explicitly provide details on how the training, validation, and test datasets were split (e.g., percentages, sample counts for each split, or specific validation sets).
Hardware Specification No The paper vaguely states 'All experiments were run on GPU s.' but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using 'Theano Bastien et al. (2012)' but does not specify a version number for Theano or other software dependencies, which is required for reproducibility.
Experiment Setup Yes We tune the hyper-parameters of the optimization methods with random search. We have sampled the learning rate from a logarithmic scale between [0.1, 0.01] for stochastic gradient descent (SGD) and equilibrated SGD (ESGD). The learning rate for RMSProp and the Jacobi preconditioner are sampled from [0.001, 0.0001]. The damping factor λ used before dividing the gradient is taken from either {10 4, 10 5, 10 6} while the exponential decay rate of RMSProp is taken from either {0.9, 0.95}. The networks are initialized using the sparse initialization described in Martens (2010). The minibatch size for all methods in 200.