Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Authors: Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, Massimiliano Pontil

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The aim of the following experiments is threefold. First, we investigate the impact of the number of iterations of the optimization dynamics on the quality of the solution on a simple multiclass classification problem. Second, we test our hyper-representation method in the context of few-shot learning on two benchmark datasets. Finally, we constrast the bilevel ML approach against classical approaches to learn shared representations.
Researcher Affiliation Academia 1Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia, Genoa, Italy 2Department of Computer Science, University College London, London, UK 3Department of Information Engineering, Universit a degli Studi di Firenze, Florence, Italy.
Pseudocode Yes Algorithm 1. Reverse-HG for Hyper-representation
Open Source Code Yes The code for reproducing the experiments, based on the package FAR-HO (https://bit.ly/far-ho), is available at https://bit.ly/hyper-repr
Open Datasets Yes OMNIGLOT (Lake et al., 2015), a dataset that contains examples of 1623 different handwritten characters from 50 alphabets. ... MINIIMAGENET (Vinyals et al., 2016), a subset of Image Net (Deng et al., 2009), that contains 60000 downsampled images from 100 different classes.
Dataset Splits Yes A training set Dtr and a validation set Dval, each consisting of three randomly drawn examples per class, were sampled to form the HO problem. ... each meta-dataset consists of a pool of samples belonging to different (non-overlapping between separate meta-dataset) classes, which can be combined to form ground classification datasets Dj = Dj tr Dj val with 5 or 20 classes (for Omniglot).
Hardware Specification Yes Table 2. Execution times on a NVidia Tesla M40 GPU.
Software Dependencies No The paper mentions using a package 'FAR-HO' but does not specify version numbers for this or any other software dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The optimization of H is performed with gradient descent with momentum, with same initialization, step size and momentum factor for each run. ... We initialize ground models parameters wj to 0 and... we perform T gradient descent steps, where T is treated as a ML hyperparameter that has to be validated. ... We compute a stochastic approximation of f T (λ) with Algorithm 1 and use Adam with decaying learning rate to optimize λ.