Continual Reinforcement Learning with Complex Synapses

Authors: Christos Kaplanis, Murray Shanahan, Claudia Clopath

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we show that by equipping tabular and deep reinforcement learning agents with a synaptic model that incorporates this biological complexity (Benna & Fusi, 2016), catastrophic forgetting can be mitigated at multiple timescales. In particular, we find that as well as enabling continual learning across sequential training of two simple tasks, it can also be used to overcome within-task forgetting by reducing the need for an experience replay database.
Researcher Affiliation Collaboration 1Department of Computing, Imperial College London 2Department of Bioengineering, Imperial College London 3Google Deep Mind, London.
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the methodology is openly available.
Open Datasets Yes The version used was Cart Pole-v1 from the Open AI Gym (Brockman et al., 2016)
Dataset Splits No The paper does not explicitly mention validation dataset splits. It describes criteria for deeming a task (re)learnt during training, but not specific data partitioning for validation.
Hardware Specification No The paper does not specify the exact hardware (e.g., CPU, GPU models) used for the experiments.
Software Dependencies No The paper mentions algorithms and methods (e.g., Euler method, Adam optimizer, soft Q-learning objective) and cites related work, but does not provide specific software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes In our experiments, we simulated the Benna-Fusi ODEs using the Euler method for numerical integration... In a Benna-Fusi chain of length N, C1 g1,2 and CN g N,N+1 determine the shortest and longest memory timescales... we set g1,2 to 10 5 to correspond roughly to the minimum number of Q-learning updates per epoch, and the number of variables in each chain to 3, all of which were initialised to 0. The ODEs were numerically integrated after every Q-learning update with a time step of t = 1. A table of all parameters used for simulation is shown in Supp. Table 1. The control agent was essentially a DQN (Mnih et al., 2015) with two fully connected hidden layers of 400 and 200 Re LUs respectively... The network was trained with the soft Q-learning objective (Haarnoja et al., 2017)... The experience replay database had a size of 2000, from which 64 experiences were sampled for training with Adam (Kingma & Ba, 2014) at the end of every episode. Crucially, the database was cleared at the end of every epoch... The agent was ϵ-greedy with respect to the stochastic soft Q-learning policy and ϵ was decayed from 1 to almost 0 over the course of each epoch. Finally, soft target network updates were used as in (Lillicrap et al., 2015)... A full table of parameters used can be seen in Supp. Table 2. The Benna-Fusi agent was identical to the control agent, except that each network parameter was modelled as a Benna Fusi synapse with 30 variables with g1,2 set to 0.001625... For this reason, the effective flow between u1 and u2 was 64 0.001625 = 0.1...