Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation
Authors: Ross M Clarke, Elre Talea Oldewage, José Miguel Hernández-Lobato
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method performs competitively from varied random hyperparameter initialisations on several UCI datasets and Fashion-MNIST (using a one-layer MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a Res Net-18), in time only 2−3x greater than vanilla training. 4 EXPERIMENTS Our empirical evaluation uses the hardware and software detailed in Appendix A.1 |
| Researcher Affiliation | Academia | Ross M. Clarke University of Cambridge rmc78@cam.ac.uk Elre T. Oldewage University of Cambridge etv21@cam.ac.uk José Miguel Hernández-Lobato University of Cambridge Alan Turing Institute jmh233@cam.ac.uk |
| Pseudocode | Yes | Algorithm 1 Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation (Figure 4, left) |
| Open Source Code | Yes | Our empirical evaluation uses the hardware and software detailed in Appendix A.1, with code available at https://github.com/rmclarke/Optimising Weight Update Hyperparameters. Source code for all our experiments is provided to reviewers, and is made available on Git Hub (https: //github.com/rmclarke/Optimising Weight Update Hyperparameters). |
| Open Datasets | Yes | Our datasets are all standard in the ML literature. For completeness, we outline the licences under which they are used in Table 4. Table 4 lists: Fashion-MNIST MIT Py Torch via torchvision, Penn Treebank Proprietary; fair use subset, widely used * http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz, CIFAR-10 No licence specified Py Torch via torchvision. |
| Dataset Splits | Yes | We use the UCI/Kin8nm dataset split sizes of Gal & Ghahramani (2016) and standard 60%/20%/20% splits for training/validation/test datasets elsewhere |
| Hardware Specification | Yes | Table 3: System configurations used to run our experiments. Type CPU GPU (NVIDIA) Consumer Desktop Intel Core i7-3930K RTX 2080GTX Local Cluster Intel Core i9-10900X RTX 2080GTX Cambridge Service for Data Driven Discovery (CSD3)* AMD EPYC 7763 Ampere A100 |
| Software Dependencies | Yes | Table 3: System configurations used to run our experiments. Python Py Torch CUDA Consumer Desktop 3.9.7 1.8.1 10.1 Local Cluster 3.7.12 1.8.1 10.1 Cambridge Service for Data Driven Discovery (CSD3)* 3.7.7 1.10.1 11.1 |
| Experiment Setup | Yes | Throughout, we train models using SGD with weight decay and momentum. We uniformly sample initial learning rates, weight decays and momenta, using logarithmic and sigmoidal transforms (see Appendix B.3), applying each initialisation in the following eight settings: updating hyperparameters every T = 10 batches with look-back distance i = 5 steps (except Baydin, which has no such meta-hyperparameters, so updates hyperparameters at every batch (Baydin et al., 2018)). Our approximate hypergradient is passed to Adam (Kingma & Ba, 2015) with meta-learning rate κ = 0.05 and default β1 = 0.9, β2 = 0.999. |