Learned Optimizers that Scale and Generalize

Authors: Olga Wichrowska, Niru Maheswaranathan, Matthew W. Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Nando Freitas, Jascha Sohl-Dickstein

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal per-parameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a meta-training ensemble of small, diverse, optimization tasks capturing common properties of loss landscapes. The optimizer learns to outperform RMSProp/ADAM on problems in this corpus. More importantly, it performs comparably or better when applied to small convolutional neural networks, despite seeing no neural networks in its meta-training set. Finally, it generalizes to train Inception V3 and Res Net V2 architectures on the Image Net dataset for thousands of steps, optimization problems that are of a vastly different scale than those it was trained on.
Researcher Affiliation Collaboration 1Google Brain 2Work done during an internship at Google Brain. 3Stanford University 4Deepmind.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The code defining each of these problems will be open sourced shortly.
Open Datasets Yes Figure 2. Training loss versus number of optimization steps on MNIST for the Learned optimizer... Figure 4. The learned optimizer generalizes to new problem types unlike any in the meta-training set, and with many more parameters. ... On Inception V3 and Res Net V2 architectures on the Image Net dataset
Dataset Splits No The paper does not provide specific details on validation dataset splits.
Hardware Specification Yes Figure 7. Wall clock time in seconds to run a single gradient and update step for a 6-layer Conv Net architecture on an HPz440 workstation with an NVIDIA Titan X GPU.
Software Dependencies No We used a GRU architecture (Cho et al., 2014) for all three of the RNN levels.
Experiment Setup Yes The architecture used in the experimental results has a Parameter RNN hidden state size of 10, and a Tensor and Global RNN state size of 20 (the architecture used by Andrychowicz et al. (2016) had a two layer RNN for each parameter, with 20 units per layer). ... We used a GRU architecture (Cho et al., 2014) for all three of the RNN levels. ... the learning rate is initialized from a log uniform distribution from 10 6 to 10 2. ... The optimizers were meta-trained for at least 40M meta-iterations ... The meta-objective was minimized with asynchronous RMSProp across 1000 workers, with a learning rate of 10 6.