Gradient-based Hyperparameter Optimization Over Long Horizons

Authors: Paul Micaelli, Amos J. Storkey

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show how FDS mitigates gradient degradation and outperforms competing HPO methods for tasks with a long horizon. In Sections 5.1 and 5.2 we consider small datasets (MNIST and SVHN) and a small network (Le Net) to make reverse-mode differentiation tractable, so that the effect of hyperparameter sharing can be directly measured. In Sections 5.3 and 5.4, we then showcase FDS on CIFAR10 where only forward-mode differentiation is tractable.
Researcher Affiliation Academia Paul Micaelli University of Edinburgh {paul.micaelli}@ed.ac.uk Amos Storkey University of Edinburgh {a.storkey}@ed.ac.uk
Pseudocode Yes Algorithm 1 Simplified FDS algorithm when learning N α learning rates for the SGD optimizer with momentum.
Open Source Code Yes Code is available at: https://github.com/polo5/FDS
Open Datasets Yes We consider large hyperparameter search ranges on CIFAR-10 where we significantly outperform greedy gradientbased alternatives... and In Sections 5.1 and 5.2 we consider small datasets (MNIST and SVHN)...
Dataset Splits No The paper mentions using a 'validation dataset Dval' and states that 'hyperparameters such that the final weights minimize some validation loss', but it does not provide specific details on the dataset split percentages, sample counts, or the methodology for creating these splits.
Hardware Specification Yes All experiments are carried out on a single GTX 2080 GPU.
Software Dependencies No The paper mentions using 'Pytorch' and 'Hp Bandster' for implementation, but does not provide specific version numbers for these or any other software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes For the learning rate, we choose W such that the ratio of T/W is similar to the optimal one found in 5.1, and we set W = T for the momentum and weight decay since only a single value is commonly used for these hyperparameters. The schedules learned are shown in Figure 4, which demonstrates that FDS converges in just 10 outer steps to hyperparameters that are very different to online greedy differentiation [14], and correspond to significantly better test accuracy performances. ... α ∈ [−1, 1], β ∈ [−1.5, 1.5], and ξ ∈ [−4 · 10−3, 4 · 10−3], which includes many poor hyperparameter values.