A Simple Guard for Learned Optimizers
Authors: Isabeau Prémont-Schwarz, Jaroslav Vı́tků, Jan Feyereisl
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments, This chapter compares the proposed LGL2O with the original GL2O, non-guarded L2O and baseline hand-crafted algorithms. Then we follow with out of distribution experiments. First with only the dataset being out of distribution (other than MNIST), then with only the optimizee being out of distribution (Conv Nets instead of MLPs), and finally both. We prove that our guard keeps the convergence guarantee of the designed optimizer. We show theoretical proof of LGL2O s convergence guarantee. |
| Researcher Affiliation | Industry | 1Good AI, Prague, Czechia. Correspondence to: Isabeau Pr emont-Schwarz <premont-schwarz@goodai.com>, Jaroslav V ıtk u <jaroslav.vitku@goodai.com>. |
| Pseudocode | Yes | Algorithm 1 Loss Guarded L2O with (deterministic) gradient descent, Algorithm 2 Loss Guarded L2O with stochastic gradient descent |
| Open Source Code | No | The paper does not provide any specific links to open-source code or explicit statements about code availability for the described methodology. |
| Open Datasets | Yes | The experiments were conducted on publicly available datasets, namely MNIST (Le Cun & Cortes, 2010), Fashion MNIST (Xiao et al., 2017), CIFAR10 (Krizhevsky, 2009), Tiny Imagenet1 a subset of Imagenet dataset (Russakovsky et al., 2015) and simple datasets from the Scikit-learn library (Pedregosa et al., 2011). Tiny Imagenet dataset is publicly available from Kaggle competition website at https://www.kaggle.com/c/tinyimagenet/data |
| Dataset Splits | Yes | Sample nt train mini-batches Bt = [b1, . . . , bnt], Sample nc validation mini-batches Bv = [v1, . . . , vnc], In all our experiments, both nt and nc are chosen to be 10. |
| Hardware Specification | Yes | Every run was run on a single NVIDIA GPU with a memory of between 4Gb and 12 Gb. |
| Software Dependencies | Yes | All experiments were code in Python 3.9 with Py Torch 1.8.1 on CUDA 11.0. The Sci Kit learn datasets were loaded from Sci Kit-Learn version 0.24.0. |
| Experiment Setup | Yes | In all our experiments, both nt and nc are chosen to be 10. The learned optimizer (and it s weights) is identical in all experiments and consists of an LSTM (Hochreiter & Jurgen Schmidhuber, 1997) with 2 hidden layers of 20 cells each and a linear output layer which was meta-trained with a rollout-length of 100 steps to optimize an MLP on the MNIST dataset. Adam were found for each combination of optimizee (MLP or Conv Net) and dataset found from the set [0.0001, 0.001, 0.01, 0.1] over 300 optimization steps. In case of SGD, the learning rate was set based on practical experience to 3.0 and (the optional) momentum to 0.9. |