Online Learning Rate Adaptation with Hypergradient Descent

Authors: Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, Frank Wood

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the behavior of HD in several tasks, comparing the behavior of the variant algorithms SGD-HD (Algorithm 4), SGDN-HD (Algorithm 5), and Adam-HD (Algorithm 6) to that of their ancestors SGD (Algorithm 1), SGDN (Algorithm 2), and Adam (Algorithm 3) showing, in all cases, a move of the loss trajectory closer to the optimum that would be attained by a tuned initial learning rate. The algorithms are implemented in Torch (Collobert et al., 2011) and PyTorch (Paszke et al., 2017)...
Researcher Affiliation Academia Atılım Gunes Baydin University of Oxford gunes@robots.ox.ac.uk Robert Cornish University of Oxford rcornish@robots.ox.ac.uk David Mart ınez Rubio University of Oxford david.martinez2@wadham.ac.uk Mark Schmidt University of British Columbia schmidtm@cs.ubc.ca Frank Wood University of Oxford fwood@robots.ox.ac.uk
Pseudocode Yes Algorithm 1 Stochastic gradient descent (SGD) ... Algorithm 4 SGD with hyp. desc. (SGD-HD) ... Algorithm 2 SGD with Nesterov (SGDN) ... Algorithm 5 SGDN with hyp. desc. (SGDN-HD) ... Algorithm 3 Adam ... Algorithm 6 Adam with hyp. desc. (Adam-HD)
Open Source Code Yes Code will be shared here: https://github.com/gbaydin/hypergradient-descent
Open Datasets Yes for the task of image classification with the MNIST database.
Dataset Splits Yes We use the full 60,000 images in MNIST for training and compute the validation loss using the 10,000 test images.
Hardware Specification Yes Experiments were run using PyTorch, on a machine with Intel Core i7-6850K CPU, 64 GB RAM, and NVIDIA Titan Xp GPU, where the longest training (200 epochs of the VGG Net on CIFAR-10) lasted approximately two hours for each run.
Software Dependencies No The algorithms are implemented in Torch (Collobert et al., 2011) and PyTorch (Paszke et al., 2017) using an API compatible with the popular torch.optim package... (Specific version numbers for Torch or PyTorch are not provided in the text).
Experiment Setup Yes We use a learning rate of α = 0.001 for all algorithms, where for the HD variants this is taken as the initial α0. We take µ = 0.9 for SGDN and SGDN-HD. For Adam, we use β1 = 0.9, β2 = 0.999, ϵ = 10 8, and apply a 1/ t decay to the learning rate αt = α/ t... We use a minibatch size of 128 for all the experiments in the paper.