reproducibilityindex.ai

Equilibrated adaptive learning rates for non-convex optimization

Authors: Yann Dauphin, Harm de Vries, Yoshua Bengio

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.
Researcher Affiliation	Academia	Yann N. Dauphin1 Universit e de Montr eal dauphiya@iro.umontreal.ca Harm de Vries1 Universit e de Montr eal devries@iro.umontreal.ca Yoshua Bengio Universit e de Montr eal yoshua.bengio@umontreal.ca
Pseudocode	Yes	The pseudo-code is given in Algorithm 1.
Open Source Code	No	The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We use the standard network architectures described in Martens (2010) for the MNIST and CURVES dataset. Both of these datasets have 784 input dimensions and 60,000 and 20,000 examples respectively.
Dataset Splits	No	The paper mentions using the MNIST and CURVES datasets but does not explicitly provide details on how the training, validation, and test datasets were split (e.g., percentages, sample counts for each split, or specific validation sets).
Hardware Specification	No	The paper vaguely states 'All experiments were run on GPU s.' but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions using 'Theano Bastien et al. (2012)' but does not specify a version number for Theano or other software dependencies, which is required for reproducibility.
Experiment Setup	Yes	We tune the hyper-parameters of the optimization methods with random search. We have sampled the learning rate from a logarithmic scale between [0.1, 0.01] for stochastic gradient descent (SGD) and equilibrated SGD (ESGD). The learning rate for RMSProp and the Jacobi preconditioner are sampled from [0.001, 0.0001]. The damping factor λ used before dividing the gradient is taken from either {10 4, 10 5, 10 6} while the exponential decay rate of RMSProp is taken from either {0.9, 0.95}. The networks are initialized using the sparse initialization described in Martens (2010). The minibatch size for all methods in 200.