Short-Term Plasticity Neurons Learning to Learn and Forget
Authors: Hector Garcia Rodriguez, Qinghai Guo, Timoleon Moraitis
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here we present a new type of recurrent neural unit, the STP Neuron (STPN), which indeed turns out strikingly powerful. Its key mechanism is that synapses have a state, propagated through time by a self-recurrent connection-within-the-synapse. This formulation enables training the plasticity with backpropagation through time, resulting in a form of learning to learn and forget in the short term. The STPN outperforms all tested alternatives, i.e. RNNs, LSTMs, other models with fast weights, and differentiable plasticity. We confirm this in both supervised and reinforcement learning (RL), and in tasks such as Associative Retrieval, Maze Exploration, Atari video games, and Mu Jo Co robotics. |
| Researcher Affiliation | Collaboration | 1Huawei Technologies Zurich Research Center, Switzerland 2University College London, United Kingdom 3Advanced Computing & Storage Lab, Huawei Technologies, Shenzhen, China. |
| Pseudocode | Yes | Algorithm 1 STPN learning to learn and forget in a supervised meta-learning setting |
| Open Source Code | Yes | Code is available at https://github.com/ Neuromorphic Computing/stpn. |
| Open Datasets | Yes | Associative Retrieval Task (ART) (Ba et al., 2016); Maze Exporation: Maze or grid-like tasks have been commonly used in RL... (Miconi et al., 2018); Atari games and Mu Joco simulated robotics: ...Atari Pong and Mu Jo Co Inverted Pendulum. Pong is an Atari 2600 game implemented in the Arcade Learning Environment (ALE) (Bellemare et al., 2013); Mu Jo Co (Todorov et al., 2012) is a physics engine widely used for research in robotics and reinforcement learning. Inverted Pendulum is one of the simplest tasks within Mu Jo Co. |
| Dataset Splits | Yes | Fig. 2 shows that the STPN is more proficient, i.e. obtains larger validation accuracy and reward, than all other baselines.; For the dataset mode, this means accuracy on a validation set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It mentions 'hypothetical analog neuromorphic hardware' in the context of energy consumption measurement of the model, not the actual experimental setup. |
| Software Dependencies | No | The paper mentions using 'RLLib (Liang et al., 2018)', 'A2C... A3C (Mnih et al., 2016)', and 'Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017)', 'Mu Jo Co (Todorov et al., 2012)', but does not specify version numbers for these software components. |
| Experiment Setup | Yes | We tune rollout length (50), gradient clipping (40), discount factor (0.99) in shorter runs (which both models share in the displayed results); and additionally tune initial learning rate for the final longer runs (0.0007 and 0.0001 respectively), using a linear decay learning rate schedule finishing at 10 11 at 200 million iterations. Models are trained from the experienced collected by 64 parallel agents.; We only increase the batch size (number of agents acting in parallel ) from 16 (in the code, not mentioned in the article) to 512 to maximize computational efficiency of gradient updates. |