Equilibrated adaptive learning rates for non-convex optimization
Authors: Yann Dauphin, Harm de Vries, Yoshua Bengio
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent. |
| Researcher Affiliation | Academia | Yann N. Dauphin1 Universit e de Montr eal dauphiya@iro.umontreal.ca Harm de Vries1 Universit e de Montr eal devries@iro.umontreal.ca Yoshua Bengio Universit e de Montr eal yoshua.bengio@umontreal.ca |
| Pseudocode | Yes | The pseudo-code is given in Algorithm 1. |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We use the standard network architectures described in Martens (2010) for the MNIST and CURVES dataset. Both of these datasets have 784 input dimensions and 60,000 and 20,000 examples respectively. |
| Dataset Splits | No | The paper mentions using the MNIST and CURVES datasets but does not explicitly provide details on how the training, validation, and test datasets were split (e.g., percentages, sample counts for each split, or specific validation sets). |
| Hardware Specification | No | The paper vaguely states 'All experiments were run on GPU s.' but does not provide specific hardware details such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using 'Theano Bastien et al. (2012)' but does not specify a version number for Theano or other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We tune the hyper-parameters of the optimization methods with random search. We have sampled the learning rate from a logarithmic scale between [0.1, 0.01] for stochastic gradient descent (SGD) and equilibrated SGD (ESGD). The learning rate for RMSProp and the Jacobi preconditioner are sampled from [0.001, 0.0001]. The damping factor λ used before dividing the gradient is taken from either {10 4, 10 5, 10 6} while the exponential decay rate of RMSProp is taken from either {0.9, 0.95}. The networks are initialized using the sparse initialization described in Martens (2010). The minibatch size for all methods in 200. |