Reverse engineering learned optimizers reveals known and novel mechanisms

Authors: Niru Maheswaranathan, David Sussillo, Luke Metz, Ruoxi Sun, Jascha Sohl-Dickstein

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study learned optimizers trained from scratch on four disparate tasks, and discover that they have learned interpretable behavior, including: momentum, gradient clipping, learning rate schedules, and learning rate adaptation. Moreover, we show how dynamics and mechanisms inside of learned optimizers orchestrate these computations. Our results help elucidate the previously murky understanding of how learned optimizers work, and establish tools for interpreting future learned optimizers. Figure 1: Learned optimizers outperform well tuned baselines on four tasks: (a) linear regression, (b) the Rosenbrock function, (c) training a fully connected neural network on the two moons dataset, and (d) training a convolutional neural network on the MNIST dataset.
Researcher Affiliation Industry Niru Maheswaranathan Google Research, Brain Team niru@hey.com David Sussillo Google Research, Brain Team Luke Metz Google Research, Brain Team lmetz@google.com Ruoxi Sun Google Research, Brain Team ruoxis@google.com Jascha Sohl-Dickstein Google Research, Brain Team jaschasd@google.com
Pseudocode No The paper describes conceptual mechanisms and analyses but does not include any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes We provide code for training and analyzing learned optimizers, as well as the trained weights for the learned optimizers studied here, at https://bit.ly/3eqgNrH.
Open Datasets Yes MNIST: The fourth task is to train a four layer convolutional network to classify digits from the MNIST dataset.
Dataset Splits No No explicit statements found regarding specific training, validation, or test dataset split percentages, absolute sample counts, or citations to predefined splits. The paper mentions hyperparameter tuning ('We selected the hyperparameters for each problem out of 2500 samples randomly drawn from a grid') which implies a validation process, but without specific split details.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions general software libraries used for analysis such as NumPy, SciPy, Matplotlib, and JAX (via citations), but does not explicitly provide specific version numbers for these software dependencies within the main text or a dedicated setup section.
Experiment Setup No The paper refers to appendices for specific experimental setup details, stating 'See App. D.2 for details about the optimizer architecture and meta-training procedures' and 'Details about the exact grid ranges used for each task are in App. D.3.' The main text itself does not contain concrete hyperparameter values or detailed training configurations.