reproducibilityindex.ai

Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

Authors: Matthew Mackay, Paul Vicol, Jonathan Lorraine, David Duvenaud, Roger Grosse

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self Tuning Networks (STNs). We evaluate the performance of STNs on large-scale deep-learning problems with the Penn Treebank (Marcus et al., 1993) and CIFAR-10 datasets (Krizhevsky & Hinton, 2009), and find that they substantially outperform baseline methods.
Researcher Affiliation	Collaboration	Matthew Mac Kay , Paul Vicol , Jon Lorraine, David Duvenaud, Roger Grosse {mmackay,pvicol,lorraine,duvenaud,rgrosse}@cs.toronto.edu University of Toronto Vector Institute
Pseudocode	Yes	We give our complete algorithm as Algorithm 1 and show how it can be implemented in code in Appendix G. Algorithm 1 STN Training Algorithm
Open Source Code	Yes	We give our complete algorithm as Algorithm 1 and show how it can be implemented in code in Appendix G. In this section, we provide PyTorch code listings for the approximate best-response layers used to construct ST-LSTMs and ST-CNNs: the Hyper Linear and Hyper Conv2D classes. We also provide a simplified version of the optimization steps used on the training set and validation set.
Open Datasets	Yes	Empirically, we evaluate the performance of STNs on large-scale deep-learning problems with the Penn Treebank (Marcus et al., 1993) and CIFAR-10 datasets (Krizhevsky & Hinton, 2009)
Dataset Splits	Yes	Here, we present additional details on the CNN experiments. For all results, we held out 20% of the training data for validation.
Hardware Specification	No	The paper does not specify any hardware details like CPU, GPU models, or memory used for the experiments.
Software Dependencies	No	The paper mentions using 'PyTorch' but does not specify any version numbers for PyTorch or other software dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	We used a 2-layer LSTM with 650 hidden units per layer and 650dimensional word embeddings. We tuned 7 hyperparameters: variational dropout rates for the input, hidden state, and output; embedding dropout... and coefficients α and β... To optimize the baseline LSTM, we used SGD with initial learning rate 30, which was decayed by a factor of 4... We used gradient clipping 0.25. For the hyperparameters, we used Adam with learning rate 0.01. We used an alternating training schedule in which we updated the model parameters for 2 steps on the training set and then updated the hyperparameters for 1 step on the validation set. We used one epoch of warm-up... We terminated training when the learning rate dropped below 0.0003.