Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions
Authors: Matthew Mackay, Paul Vicol, Jonathan Lorraine, David Duvenaud, Roger Grosse
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self Tuning Networks (STNs). We evaluate the performance of STNs on large-scale deep-learning problems with the Penn Treebank (Marcus et al., 1993) and CIFAR-10 datasets (Krizhevsky & Hinton, 2009), and find that they substantially outperform baseline methods. |
| Researcher Affiliation | Collaboration | Matthew Mac Kay , Paul Vicol , Jon Lorraine, David Duvenaud, Roger Grosse {mmackay,pvicol,lorraine,duvenaud,rgrosse}@cs.toronto.edu University of Toronto Vector Institute |
| Pseudocode | Yes | We give our complete algorithm as Algorithm 1 and show how it can be implemented in code in Appendix G. Algorithm 1 STN Training Algorithm |
| Open Source Code | Yes | We give our complete algorithm as Algorithm 1 and show how it can be implemented in code in Appendix G. In this section, we provide PyTorch code listings for the approximate best-response layers used to construct ST-LSTMs and ST-CNNs: the Hyper Linear and Hyper Conv2D classes. We also provide a simplified version of the optimization steps used on the training set and validation set. |
| Open Datasets | Yes | Empirically, we evaluate the performance of STNs on large-scale deep-learning problems with the Penn Treebank (Marcus et al., 1993) and CIFAR-10 datasets (Krizhevsky & Hinton, 2009) |
| Dataset Splits | Yes | Here, we present additional details on the CNN experiments. For all results, we held out 20% of the training data for validation. |
| Hardware Specification | No | The paper does not specify any hardware details like CPU, GPU models, or memory used for the experiments. |
| Software Dependencies | No | The paper mentions using 'PyTorch' but does not specify any version numbers for PyTorch or other software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We used a 2-layer LSTM with 650 hidden units per layer and 650dimensional word embeddings. We tuned 7 hyperparameters: variational dropout rates for the input, hidden state, and output; embedding dropout... and coefficients α and β... To optimize the baseline LSTM, we used SGD with initial learning rate 30, which was decayed by a factor of 4... We used gradient clipping 0.25. For the hyperparameters, we used Adam with learning rate 0.01. We used an alternating training schedule in which we updated the model parameters for 2 steps on the training set and then updated the hyperparameters for 1 step on the validation set. We used one epoch of warm-up... We terminated training when the learning rate dropped below 0.0003. |