Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters

Authors: Jelena Luketina, Mathias Berglund, Klaus Greff, Tapani Raiko

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explore the approach for tuning regularization hyperparameters and find that in experiments on MNIST, SVHN and CIFAR-10, the resulting regularization levels are within the optimal regions.
Researcher Affiliation Academia Jelena Luketina1 JELENA.LUKETINA@AALTO.FI Mathias Berglund1 MATHIAS.BERGLUND@AALTO.FI Klaus Greff2 KLAUS@IDSIA.CH Tapani Raiko1 TAPANI.RAIKO@AALTO.FI 1Department of Computer Science, Aalto University, Finland 2IDSIA, Dalle Molle Institute for Artificial Intelligence, USI-SUPSI, Manno-Lugano, Switzerland
Pseudocode No The paper describes the proposed method mathematically and textually but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes All the code, as well as the exact configurations used in the experiments can be found in the project s Github repository2. 2https://github.com/jelennal/t1t2
Open Datasets Yes We test the method on various configurations of multilayer perceptrons (MLPs) with Re LU activation functions (Dahl et al., 2013) trained on the MNIST (Le Cun et al., 1998) and SVHN (Netzer et al., 2011) data set. We also test the method on two convolutional architectures (CNNs) using CIFAR10 (Krizhevsky, 2009).
Dataset Splits Yes For MNIST we tried various network sizes: shallow 1000 1000 1000 to deep 4000 2000 1000 500 250. Training set T1 had 55 000 samples, and validation T2 had 5 000 samples. The split between T1 and T2 was made using a different random seed in each of the experiments to avoid overfitting to a particular subset of the training set. ... Out of 73257 training samples, we picked a random 65 000 samples for T1 and the remaining 8 257 samples for T2. ... To test on CIFAR-10 with convolutional networks, we used 45 000 samples for T1 and 5 000 samples for T2.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The models were implemented with the Theano package (Team, 2016). However, a specific version number for Theano or other software dependencies is not provided.
Experiment Setup Yes For MNIST we tried various network sizes: shallow 1000 1000 1000 to deep 4000 2000 1000 500 250. ... Each of the experiments were run with 200-300 epochs, using batch size 100 for both elementary and hyperparameter training. To speed up elementary parameter training, we use an annealed ADAM learning rate schedule (Kingma and Ba, 2015) with a step size of 10 3 (MLPs) or 2 10 3 (CNNs). For tuning noise hyperparameters, we use vanilla gradient descent with a step size 10 1; while for L2 hyperparameters, step sizes were significantly smaller, 10 4. In experiments on larger networks we also use ADAM for tuning hyperparameters, with the step size 10 3 for noise and 10 6 for L2.