Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

Authors: Matthew Mackay, Paul Vicol, Jonathan Lorraine, David Duvenaud, Roger Grosse

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self Tuning Networks (STNs). We evaluate the performance of STNs on large-scale deep-learning problems with the Penn Treebank (Marcus et al., 1993) and CIFAR-10 datasets (Krizhevsky & Hinton, 2009), and find that they substantially outperform baseline methods.
Researcher Affiliation Collaboration Matthew Mac Kay , Paul Vicol , Jon Lorraine, David Duvenaud, Roger Grosse {mmackay,pvicol,lorraine,duvenaud,rgrosse}@cs.toronto.edu University of Toronto Vector Institute
Pseudocode Yes We give our complete algorithm as Algorithm 1 and show how it can be implemented in code in Appendix G. Algorithm 1 STN Training Algorithm
Open Source Code Yes We give our complete algorithm as Algorithm 1 and show how it can be implemented in code in Appendix G. In this section, we provide PyTorch code listings for the approximate best-response layers used to construct ST-LSTMs and ST-CNNs: the Hyper Linear and Hyper Conv2D classes. We also provide a simplified version of the optimization steps used on the training set and validation set.
Open Datasets Yes Empirically, we evaluate the performance of STNs on large-scale deep-learning problems with the Penn Treebank (Marcus et al., 1993) and CIFAR-10 datasets (Krizhevsky & Hinton, 2009)
Dataset Splits Yes Here, we present additional details on the CNN experiments. For all results, we held out 20% of the training data for validation.
Hardware Specification No The paper does not specify any hardware details like CPU, GPU models, or memory used for the experiments.
Software Dependencies No The paper mentions using 'PyTorch' but does not specify any version numbers for PyTorch or other software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes We used a 2-layer LSTM with 650 hidden units per layer and 650dimensional word embeddings. We tuned 7 hyperparameters: variational dropout rates for the input, hidden state, and output; embedding dropout... and coefficients α and β... To optimize the baseline LSTM, we used SGD with initial learning rate 30, which was decayed by a factor of 4... We used gradient clipping 0.25. For the hyperparameters, we used Adam with learning rate 0.01. We used an alternating training schedule in which we updated the model parameters for 2 steps on the training set and then updated the hyperparameters for 1 step on the validation set. We used one epoch of warm-up... We terminated training when the learning rate dropped below 0.0003.