Learning to learn by gradient descent by gradient descent

Authors: Marcin Andrychowicz, Misha Denil, Sergio Gómez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de Freitas

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.
Researcher Affiliation Collaboration 1Google Deep Mind 2University of Oxford 3Canadian Institute for Advanced Research
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes In this experiment we test whether trainable optimizers can learn to optimize a small neural network on MNIST... Next we test the performance of the trained neural optimizers on optimizing classification performance for the CIFAR-10 dataset [Krizhevsky, 2009]... We train optimizers using only 1 style and 1800 content images taken from Image Net [Deng et al., 2009].
Dataset Splits Yes We randomly select 100 content images for testing and 20 content images for validation of trained optimizers.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions using 'ADAM' and the 'optim package in Torch7', but it does not specify version numbers for these software components.
Experiment Setup Yes In all experiments the trained optimizers use two-layer LSTMs with 20 hidden units in each layer. Each optimizer is trained by minimizing Equation 3 using truncated BPTT as described in Section 2. The minimization is performed using ADAM with a learning rate chosen by random search... The base network is an MLP with one hidden layer of 20 units using a sigmoid activation function... Each optimization was run for 100 steps and the trained optimizers were unrolled for 20 steps. We used input preprocessing described in Appendix A and rescaled the outputs of the LSTM by the factor 0.1.