Learned Optimizers that Scale and Generalize
Authors: Olga Wichrowska, Niru Maheswaranathan, Matthew W. Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Nando Freitas, Jascha Sohl-Dickstein
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal per-parameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a meta-training ensemble of small, diverse, optimization tasks capturing common properties of loss landscapes. The optimizer learns to outperform RMSProp/ADAM on problems in this corpus. More importantly, it performs comparably or better when applied to small convolutional neural networks, despite seeing no neural networks in its meta-training set. Finally, it generalizes to train Inception V3 and Res Net V2 architectures on the Image Net dataset for thousands of steps, optimization problems that are of a vastly different scale than those it was trained on. |
| Researcher Affiliation | Collaboration | 1Google Brain 2Work done during an internship at Google Brain. 3Stanford University 4Deepmind. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The code defining each of these problems will be open sourced shortly. |
| Open Datasets | Yes | Figure 2. Training loss versus number of optimization steps on MNIST for the Learned optimizer... Figure 4. The learned optimizer generalizes to new problem types unlike any in the meta-training set, and with many more parameters. ... On Inception V3 and Res Net V2 architectures on the Image Net dataset |
| Dataset Splits | No | The paper does not provide specific details on validation dataset splits. |
| Hardware Specification | Yes | Figure 7. Wall clock time in seconds to run a single gradient and update step for a 6-layer Conv Net architecture on an HPz440 workstation with an NVIDIA Titan X GPU. |
| Software Dependencies | No | We used a GRU architecture (Cho et al., 2014) for all three of the RNN levels. |
| Experiment Setup | Yes | The architecture used in the experimental results has a Parameter RNN hidden state size of 10, and a Tensor and Global RNN state size of 20 (the architecture used by Andrychowicz et al. (2016) had a two layer RNN for each parameter, with 20 units per layer). ... We used a GRU architecture (Cho et al., 2014) for all three of the RNN levels. ... the learning rate is initialized from a log uniform distribution from 10 6 to 10 2. ... The optimizers were meta-trained for at least 40M meta-iterations ... The meta-objective was minimized with asynchronous RMSProp across 1000 workers, with a learning rate of 10 6. |