Gradient-based Hyperparameter Optimization Over Long Horizons
Authors: Paul Micaelli, Amos J. Storkey
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show how FDS mitigates gradient degradation and outperforms competing HPO methods for tasks with a long horizon. In Sections 5.1 and 5.2 we consider small datasets (MNIST and SVHN) and a small network (Le Net) to make reverse-mode differentiation tractable, so that the effect of hyperparameter sharing can be directly measured. In Sections 5.3 and 5.4, we then showcase FDS on CIFAR10 where only forward-mode differentiation is tractable. |
| Researcher Affiliation | Academia | Paul Micaelli University of Edinburgh {paul.micaelli}@ed.ac.uk Amos Storkey University of Edinburgh {a.storkey}@ed.ac.uk |
| Pseudocode | Yes | Algorithm 1 Simplified FDS algorithm when learning N α learning rates for the SGD optimizer with momentum. |
| Open Source Code | Yes | Code is available at: https://github.com/polo5/FDS |
| Open Datasets | Yes | We consider large hyperparameter search ranges on CIFAR-10 where we significantly outperform greedy gradientbased alternatives... and In Sections 5.1 and 5.2 we consider small datasets (MNIST and SVHN)... |
| Dataset Splits | No | The paper mentions using a 'validation dataset Dval' and states that 'hyperparameters such that the final weights minimize some validation loss', but it does not provide specific details on the dataset split percentages, sample counts, or the methodology for creating these splits. |
| Hardware Specification | Yes | All experiments are carried out on a single GTX 2080 GPU. |
| Software Dependencies | No | The paper mentions using 'Pytorch' and 'Hp Bandster' for implementation, but does not provide specific version numbers for these or any other software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | For the learning rate, we choose W such that the ratio of T/W is similar to the optimal one found in 5.1, and we set W = T for the momentum and weight decay since only a single value is commonly used for these hyperparameters. The schedules learned are shown in Figure 4, which demonstrates that FDS converges in just 10 outer steps to hyperparameters that are very different to online greedy differentiation [14], and correspond to significantly better test accuracy performances. ... α ∈ [−1, 1], β ∈ [−1.5, 1.5], and ξ ∈ [−4 · 10−3, 4 · 10−3], which includes many poor hyperparameter values. |