Learning to learn by gradient descent by gradient descent
Authors: Marcin Andrychowicz, Misha Denil, Sergio Gómez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de Freitas
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2University of Oxford 3Canadian Institute for Advanced Research |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | In this experiment we test whether trainable optimizers can learn to optimize a small neural network on MNIST... Next we test the performance of the trained neural optimizers on optimizing classification performance for the CIFAR-10 dataset [Krizhevsky, 2009]... We train optimizers using only 1 style and 1800 content images taken from Image Net [Deng et al., 2009]. |
| Dataset Splits | Yes | We randomly select 100 content images for testing and 20 content images for validation of trained optimizers. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using 'ADAM' and the 'optim package in Torch7', but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | In all experiments the trained optimizers use two-layer LSTMs with 20 hidden units in each layer. Each optimizer is trained by minimizing Equation 3 using truncated BPTT as described in Section 2. The minimization is performed using ADAM with a learning rate chosen by random search... The base network is an MLP with one hidden layer of 20 units using a sigmoid activation function... Each optimization was run for 100 steps and the trained optimizers were unrolled for 20 steps. We used input preprocessing described in Appendix A and rescaled the outputs of the LSTM by the factor 0.1. |