Online Hyperparameter Meta-Learning with Hypergradient Distillation
Authors: Hae Beom Lee, Hayeon Lee, JaeWoong Shin, Eunho Yang, Timothy Hospedales, Sung Ju Hwang
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our method on two different meta-learning methods and three benchmark datasets. |
| Researcher Affiliation | Collaboration | KAIST1, AITRICS2, Lunit3, South Korea, University of Edinburgh4, Samsung AI Centre, Cambridge5, United Kingdom |
| Pseudocode | Yes | Algorithm 1 Reverse-HG (RMD)... Algorithm 2 Dr MAD (Fu et al., 2016)... Algorithm 3 Hyper Distill... Algorithm 4 Linear Estimation(γ, λ, φ) |
| Open Source Code | Yes | Code is publicly available at: https://github.com/haebeom-lee/hyperdistill |
| Open Datasets | Yes | 1) Tiny Image Net. (Le & Yang, 2015) This dataset contains 200 classes of general categories... 2) CIFAR100. (Krizhevsky et al., 2009) This dataset contains 100 classes of general categories. |
| Dataset Splits | Yes | 1) Tiny Image Net. (Le & Yang, 2015)... We split them into 100, 40, and 60 classes for meta-training, meta-validation, and meta-test. 2) CIFAR100. (Krizhevsky et al., 2009)... We split them into 50, 20, and 30 classes for meta-training, meta-validation, and meta-test. |
| Hardware Specification | Yes | We used RTX 2080 Ti for the measurements. |
| Software Dependencies | No | The paper mentions optimizers like SGD and Adam (Kingma & Ba, 2015) but does not specify versions for core software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | Meta-training: For inner-optimization of the weights, we use SGD with momentum 0.9 and set the learning rate µInner = 0.1 for Meta Weight Net and µInner = 0.01 for the others. The number of inner-steps is T = 100 and batchsize is 100. We use random cropping and horizontal flipping as data augmentations. For the hyperparameter optimization, we also use SGD with momentum 0.9 with learning rate µHyper = 0.01 for Meta Weight Net and µHyper = 0.001 for the others, which we linearly decay toward 0 over total M = 1000 inner-optimizations. We perform parallel meta-learning with meta-batchsize set to 4. |