Online Hyperparameter Meta-Learning with Hypergradient Distillation

Authors: Hae Beom Lee, Hayeon Lee, JaeWoong Shin, Eunho Yang, Timothy Hospedales, Sung Ju Hwang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our method on two different meta-learning methods and three benchmark datasets.
Researcher Affiliation Collaboration KAIST1, AITRICS2, Lunit3, South Korea, University of Edinburgh4, Samsung AI Centre, Cambridge5, United Kingdom
Pseudocode Yes Algorithm 1 Reverse-HG (RMD)... Algorithm 2 Dr MAD (Fu et al., 2016)... Algorithm 3 Hyper Distill... Algorithm 4 Linear Estimation(γ, λ, φ)
Open Source Code Yes Code is publicly available at: https://github.com/haebeom-lee/hyperdistill
Open Datasets Yes 1) Tiny Image Net. (Le & Yang, 2015) This dataset contains 200 classes of general categories... 2) CIFAR100. (Krizhevsky et al., 2009) This dataset contains 100 classes of general categories.
Dataset Splits Yes 1) Tiny Image Net. (Le & Yang, 2015)... We split them into 100, 40, and 60 classes for meta-training, meta-validation, and meta-test. 2) CIFAR100. (Krizhevsky et al., 2009)... We split them into 50, 20, and 30 classes for meta-training, meta-validation, and meta-test.
Hardware Specification Yes We used RTX 2080 Ti for the measurements.
Software Dependencies No The paper mentions optimizers like SGD and Adam (Kingma & Ba, 2015) but does not specify versions for core software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes Meta-training: For inner-optimization of the weights, we use SGD with momentum 0.9 and set the learning rate µInner = 0.1 for Meta Weight Net and µInner = 0.01 for the others. The number of inner-steps is T = 100 and batchsize is 100. We use random cropping and horizontal flipping as data augmentations. For the hyperparameter optimization, we also use SGD with momentum 0.9 with learning rate µHyper = 0.01 for Meta Weight Net and µHyper = 0.001 for the others, which we linearly decay toward 0 over total M = 1000 inner-optimizations. We perform parallel meta-learning with meta-batchsize set to 4.