Learning Gradient Descent: Better Generalization and Longer Horizons

Authors: Kaifeng Lv, Shunhua Jiang, Jian Li

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our algorithms on a number of tasks, including deep MLPs, CNNs, and simple LSTMs. We trained two RNN optimizers, one to reproduce DMoptimizer in (Andrychowicz et al., 2016), the other to implement RNNprop with our new training tricks. Their performances were compared in a number of experiments.
Researcher Affiliation Academia 1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
Pseudocode No The paper includes a diagram (Figure 1) describing the model structure and textual descriptions but no formal pseudocode or algorithm blocks.
Open Source Code Yes Our code can be found at https://github.com/vfleaking/rnnprop.
Open Datasets Yes We use the same optimizee as in (Andrychowicz et al., 2016) to train these two optimizers, which is the cross-entropy loss of a simple MLP on the MNIST dataset. The CNN optimizees are the cross-entropy losses of convolutional neural networks (CNN) with similar structure as VGGNet (Simonyan & Zisserman, 2015) on dataset MNIST or dataset CIFAR-10.
Dataset Splits Yes For DMoptimizer, we select the saved optimizer with the best performance on the validation task, same as in (Andrychowicz et al., 2016).
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments.
Software Dependencies No The paper mentions 'TensorFlow' but does not specify its version or any other software dependencies with versions.
Experiment Setup Yes The value of f(θ) is computed using a minibatch of 128 random pictures. For each iteration during training, the optimizers are allowed to run for 100 steps. The RNN is a two-layer LSTM whose hidden state size is 20. To avoid division by zero, in actual experiments we add another term ϵ = 10-8, and the input is changed to mt = ˆmt(ˆv1/2 t + ϵ)-1, gt = gt(ˆv1/2 t + ϵ)-1. The parameters β1 and β2 for computing mt and gt are simply set to 0.95. In preprocessing, the input is mapped to a 20-dim vector for each coordinate. In all our experiments, we just set a large enough value α = 0.1.