Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity

Authors: Thomas Miconi, Aditya Rawal, Jeff Clune, Kenneth O. Stanley

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that neuromodulated plasticity improves the performance of neural networks on both reinforcement learning and supervised learning tasks. In one task, neuromodulated plastic LSTMs with millions of parameters outperform standard LSTMs on a benchmark language modeling task (controlling for the number of parameters). Our experimental results establish that neuromodulated plastic networks outperform both non-plastic and non-modulated plastic networks, both on simple reinforcement learning tasks and on a complex language modeling task involving a multi-million parameter network.
Researcher Affiliation Industry Thomas Miconi , Aditya Rawal, Jeff Clune & Kenneth O. Stanley Uber AI Labs tmiconi|aditya.rawal|jeffclune|kstanley@uber.com
Pseudocode No The paper includes mathematical equations (Eq. 1-5) and descriptions of processes, but does not present them in a structured 'Pseudocode' or 'Algorithm' block format.
Open Source Code No The paper references a third-party code repository for a baseline model ('All other hyperparameters are taken from Merity & Socher (2017), using the instructions provided on the code repository for their model, available at https://github.com/salesforce/awd-lstm-lm'), but does not state that the code for the methods described in this paper is openly available.
Open Datasets Yes The Penn Tree Bank corpus (PTB), a well known benchmark for language modeling (Marcus et al., 1993), is used here for comparing different models.
Dataset Splits Yes The dataset consists of 929k training words, 73k validation words, and 82k, test words, with a vocabulary of 10k words.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions general software components and algorithms (e.g., LSTMs, SGD) but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.x', 'CUDA 11.x').
Experiment Setup Yes Initial learning rate was set 1.0. Each model is trained for 13 epochs. The hidden states of LSTM are initialized to zero; the final hidden states of the current minibatch are used as the initial hidden states of the subsequent minibatch. Grid-search was performed for four hyperparameters: (1) Learning rate decay factor in the range 0.25 to 0.4 in steps of 0.01. (2) Epoch at which learning rate decay begins in the range {4, 5, 6}. (3) Initial scale of weights in the range {0.09, 0.1, 0.11, 0.12}. (4) L2 penalty constant in the range {1e 2, 1e 3, 1e 4, 1e 5, 1e 6}. The norm of the gradient is clipped at 5. ... The main departures from Merity & Socher (2017) is that we do not implement recurrent dropout (feedforward dropout is preserved) and reduce batch size to 7 due to computational limitations. Other hyperparameters are taken as is without any tuning.