Policy Consolidation for Continual Reinforcement Learning

Authors: Christos Kaplanis, Murray Shanahan, Claudia Clopath

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the PC agent s capability for continual learning by training it on a number of continuous control tasks in three non-stationary settings that differ in how the data distribution changes over time: (i) alternating between two tasks during training, (ii) training on just one task, and (iii) in a multi-agent self-play environment. We find that on average the PC model improves continual learning relative to baselines in all three scenarios.
Researcher Affiliation Collaboration 1Department of Computing, Imperial College London 2Department of Bioengineering, Imperial College London 3Deep Mind, London.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that the code for the described methodology is open-source or provide a link to it. The only link to code is for a third-party baseline (OpenAI baselines).
Open Datasets Yes We evaluated the performance of a PC agent over a number of continuous control tasks (Brockman et al., 2016; Henderson et al., 2017; Al-Shedivat et al., 2018) in three separate RL settings. For the multi-agent experiments, agents were trained via self-play in the Robo Sumo-Ant-vs-Ant-v0 environment developed in (Al-Shedivat et al., 2018).
Dataset Splits No The paper describes training duration and scenario (e.g., '20 million steps of training', 'alternating tasks') and uses continuous control tasks where data is generated dynamically. It does not provide explicit dataset splits like percentages or sample counts for training, validation, or testing in the typical supervised learning sense, nor does it reference predefined splits with citations for these specific types of splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using 'Mujoco tasks' and refers to PPO algorithms, but does not provide specific version numbers for any software dependencies like libraries, frameworks, or operating systems.
Experiment Setup Yes All agents shared the same architecture for the policy network, namely a multilayer perceptron with two hidden layers of 64 Re LUs. The PC agent used for all experiments (unless otherwise stated) consisted of 7 hidden policy networks with β = 0.5 and ω = 4.0. One important change was that the batch sizes per update were much larger as the trajectories were made longer and also generated by multiple distributed actors (as in (Schulman et al., 2017)). A larger batch size reduces the variance of the policy gradient, which allowed us to permit larger updates in policy space by decreasing β and ω1,2 (from 0.5 and 4 to 0.1 and 0.25 respectively) in the PC model and thus speed up training.