reproducibilityindex.ai

Gradient-based Hyperparameter Optimization Over Long Horizons

Authors: Paul Micaelli, Amos J. Storkey

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show how FDS mitigates gradient degradation and outperforms competing HPO methods for tasks with a long horizon. In Sections 5.1 and 5.2 we consider small datasets (MNIST and SVHN) and a small network (Le Net) to make reverse-mode differentiation tractable, so that the effect of hyperparameter sharing can be directly measured. In Sections 5.3 and 5.4, we then showcase FDS on CIFAR10 where only forward-mode differentiation is tractable.
Researcher Affiliation	Academia	Paul Micaelli University of Edinburgh {paul.micaelli}@ed.ac.uk Amos Storkey University of Edinburgh {a.storkey}@ed.ac.uk
Pseudocode	Yes	Algorithm 1 Simpliﬁed FDS algorithm when learning N α learning rates for the SGD optimizer with momentum.
Open Source Code	Yes	Code is available at: https://github.com/polo5/FDS
Open Datasets	Yes	We consider large hyperparameter search ranges on CIFAR-10 where we signiﬁcantly outperform greedy gradientbased alternatives... and In Sections 5.1 and 5.2 we consider small datasets (MNIST and SVHN)...
Dataset Splits	No	The paper mentions using a 'validation dataset Dval' and states that 'hyperparameters such that the ﬁnal weights minimize some validation loss', but it does not provide specific details on the dataset split percentages, sample counts, or the methodology for creating these splits.
Hardware Specification	Yes	All experiments are carried out on a single GTX 2080 GPU.
Software Dependencies	No	The paper mentions using 'Pytorch' and 'Hp Bandster' for implementation, but does not provide specific version numbers for these or any other software dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	For the learning rate, we choose W such that the ratio of T/W is similar to the optimal one found in 5.1, and we set W = T for the momentum and weight decay since only a single value is commonly used for these hyperparameters. The schedules learned are shown in Figure 4, which demonstrates that FDS converges in just 10 outer steps to hyperparameters that are very different to online greedy differentiation [14], and correspond to signiﬁcantly better test accuracy performances. ... α ∈ [−1, 1], β ∈ [−1.5, 1.5], and ξ ∈ [−4 · 10−3, 4 · 10−3], which includes many poor hyperparameter values.