Short-Term Plasticity Neurons Learning to Learn and Forget

Authors: Hector Garcia Rodriguez, Qinghai Guo, Timoleon Moraitis

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we present a new type of recurrent neural unit, the STP Neuron (STPN), which indeed turns out strikingly powerful. Its key mechanism is that synapses have a state, propagated through time by a self-recurrent connection-within-the-synapse. This formulation enables training the plasticity with backpropagation through time, resulting in a form of learning to learn and forget in the short term. The STPN outperforms all tested alternatives, i.e. RNNs, LSTMs, other models with fast weights, and differentiable plasticity. We confirm this in both supervised and reinforcement learning (RL), and in tasks such as Associative Retrieval, Maze Exploration, Atari video games, and Mu Jo Co robotics.
Researcher Affiliation Collaboration 1Huawei Technologies Zurich Research Center, Switzerland 2University College London, United Kingdom 3Advanced Computing & Storage Lab, Huawei Technologies, Shenzhen, China.
Pseudocode Yes Algorithm 1 STPN learning to learn and forget in a supervised meta-learning setting
Open Source Code Yes Code is available at https://github.com/ Neuromorphic Computing/stpn.
Open Datasets Yes Associative Retrieval Task (ART) (Ba et al., 2016); Maze Exporation: Maze or grid-like tasks have been commonly used in RL... (Miconi et al., 2018); Atari games and Mu Joco simulated robotics: ...Atari Pong and Mu Jo Co Inverted Pendulum. Pong is an Atari 2600 game implemented in the Arcade Learning Environment (ALE) (Bellemare et al., 2013); Mu Jo Co (Todorov et al., 2012) is a physics engine widely used for research in robotics and reinforcement learning. Inverted Pendulum is one of the simplest tasks within Mu Jo Co.
Dataset Splits Yes Fig. 2 shows that the STPN is more proficient, i.e. obtains larger validation accuracy and reward, than all other baselines.; For the dataset mode, this means accuracy on a validation set.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It mentions 'hypothetical analog neuromorphic hardware' in the context of energy consumption measurement of the model, not the actual experimental setup.
Software Dependencies No The paper mentions using 'RLLib (Liang et al., 2018)', 'A2C... A3C (Mnih et al., 2016)', and 'Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017)', 'Mu Jo Co (Todorov et al., 2012)', but does not specify version numbers for these software components.
Experiment Setup Yes We tune rollout length (50), gradient clipping (40), discount factor (0.99) in shorter runs (which both models share in the displayed results); and additionally tune initial learning rate for the final longer runs (0.0007 and 0.0001 respectively), using a linear decay learning rate schedule finishing at 10 11 at 200 million iterations. Models are trained from the experienced collected by 64 parallel agents.; We only increase the batch size (number of agents acting in parallel ) from 16 (in the code, not mentioned in the article) to 512 to maximize computational efficiency of gradient updates.