reproducibilityindex.ai

Online Learning Rate Adaptation with Hypergradient Descent

Authors: Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, Frank Wood

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the behavior of HD in several tasks, comparing the behavior of the variant algorithms SGD-HD (Algorithm 4), SGDN-HD (Algorithm 5), and Adam-HD (Algorithm 6) to that of their ancestors SGD (Algorithm 1), SGDN (Algorithm 2), and Adam (Algorithm 3) showing, in all cases, a move of the loss trajectory closer to the optimum that would be attained by a tuned initial learning rate. The algorithms are implemented in Torch (Collobert et al., 2011) and PyTorch (Paszke et al., 2017)...
Researcher Affiliation	Academia	Atılım Gunes Baydin University of Oxford gunes@robots.ox.ac.uk Robert Cornish University of Oxford rcornish@robots.ox.ac.uk David Mart ınez Rubio University of Oxford david.martinez2@wadham.ac.uk Mark Schmidt University of British Columbia schmidtm@cs.ubc.ca Frank Wood University of Oxford fwood@robots.ox.ac.uk
Pseudocode	Yes	Algorithm 1 Stochastic gradient descent (SGD) ... Algorithm 4 SGD with hyp. desc. (SGD-HD) ... Algorithm 2 SGD with Nesterov (SGDN) ... Algorithm 5 SGDN with hyp. desc. (SGDN-HD) ... Algorithm 3 Adam ... Algorithm 6 Adam with hyp. desc. (Adam-HD)
Open Source Code	Yes	Code will be shared here: https://github.com/gbaydin/hypergradient-descent
Open Datasets	Yes	for the task of image classiﬁcation with the MNIST database.
Dataset Splits	Yes	We use the full 60,000 images in MNIST for training and compute the validation loss using the 10,000 test images.
Hardware Specification	Yes	Experiments were run using PyTorch, on a machine with Intel Core i7-6850K CPU, 64 GB RAM, and NVIDIA Titan Xp GPU, where the longest training (200 epochs of the VGG Net on CIFAR-10) lasted approximately two hours for each run.
Software Dependencies	No	The algorithms are implemented in Torch (Collobert et al., 2011) and PyTorch (Paszke et al., 2017) using an API compatible with the popular torch.optim package... (Specific version numbers for Torch or PyTorch are not provided in the text).
Experiment Setup	Yes	We use a learning rate of α = 0.001 for all algorithms, where for the HD variants this is taken as the initial α0. We take µ = 0.9 for SGDN and SGDN-HD. For Adam, we use β1 = 0.9, β2 = 0.999, ϵ = 10 8, and apply a 1/ t decay to the learning rate αt = α/ t... We use a minibatch size of 128 for all the experiments in the paper.