Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation

Authors: Ross M Clarke, Elre Talea Oldewage, José Miguel Hernández-Lobato

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method performs competitively from varied random hyperparameter initialisations on several UCI datasets and Fashion-MNIST (using a one-layer MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a Res Net-18), in time only 2−3x greater than vanilla training. 4 EXPERIMENTS Our empirical evaluation uses the hardware and software detailed in Appendix A.1
Researcher Affiliation Academia Ross M. Clarke University of Cambridge rmc78@cam.ac.uk Elre T. Oldewage University of Cambridge etv21@cam.ac.uk José Miguel Hernández-Lobato University of Cambridge Alan Turing Institute jmh233@cam.ac.uk
Pseudocode Yes Algorithm 1 Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation (Figure 4, left)
Open Source Code Yes Our empirical evaluation uses the hardware and software detailed in Appendix A.1, with code available at https://github.com/rmclarke/Optimising Weight Update Hyperparameters. Source code for all our experiments is provided to reviewers, and is made available on Git Hub (https: //github.com/rmclarke/Optimising Weight Update Hyperparameters).
Open Datasets Yes Our datasets are all standard in the ML literature. For completeness, we outline the licences under which they are used in Table 4. Table 4 lists: Fashion-MNIST MIT Py Torch via torchvision, Penn Treebank Proprietary; fair use subset, widely used * http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz, CIFAR-10 No licence specified Py Torch via torchvision.
Dataset Splits Yes We use the UCI/Kin8nm dataset split sizes of Gal & Ghahramani (2016) and standard 60%/20%/20% splits for training/validation/test datasets elsewhere
Hardware Specification Yes Table 3: System configurations used to run our experiments. Type CPU GPU (NVIDIA) Consumer Desktop Intel Core i7-3930K RTX 2080GTX Local Cluster Intel Core i9-10900X RTX 2080GTX Cambridge Service for Data Driven Discovery (CSD3)* AMD EPYC 7763 Ampere A100
Software Dependencies Yes Table 3: System configurations used to run our experiments. Python Py Torch CUDA Consumer Desktop 3.9.7 1.8.1 10.1 Local Cluster 3.7.12 1.8.1 10.1 Cambridge Service for Data Driven Discovery (CSD3)* 3.7.7 1.10.1 11.1
Experiment Setup Yes Throughout, we train models using SGD with weight decay and momentum. We uniformly sample initial learning rates, weight decays and momenta, using logarithmic and sigmoidal transforms (see Appendix B.3), applying each initialisation in the following eight settings: updating hyperparameters every T = 10 batches with look-back distance i = 5 steps (except Baydin, which has no such meta-hyperparameters, so updates hyperparameters at every batch (Baydin et al., 2018)). Our approximate hypergradient is passed to Adam (Kingma & Ba, 2015) with meta-learning rate κ = 0.05 and default β1 = 0.9, β2 = 0.999.