Reset-Free Lifelong Learning with Skill-Space Planning

Authors: Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically that Li SP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and Mu Jo Co benchmarks1. ... 4 EXPERIMENTAL EVALUATIONS We wish to investigate the following questions with regards to the design and performance of Li SP: ... We evaluate Li SP on Hopper and Ant tasks from Gym (Brockman et al., 2016); we call these Lifelong Hopper and Lifelong Ant. ... We summarize the Li SP subroutines in Figure 2. Skills are first learned via Algorithm 2, wherein the skill discriminator generates the intrinsic rewards and the skill-practice distribution generates a skill curriculum. We then plan using the skill policy and the dynamics model as per Algorithm 3.
Researcher Affiliation Collaboration Kevin Lu UC Berkeley kzl@berkeley.edu Pieter Abbeel UC Berkeley pabbeel@cs.berkeley.edu Aditya Grover UC Berkeley adityag@cs.stanford.edu Igor Mordatch Google Brain imordatch@google.com
Pseudocode Yes Algorithm 1: Lifelong Skill Planning (Li SP) ... Algorithm 2: Learning Latent Skills ... Algorithm 3: Skill-Space Planning
Open Source Code Yes Project website and materials: https://sites.google.com/berkeley.edu/reset-free-lifelong-learning ... Our code can be found at: https://github.com/kzl/lifelong_rl.
Open Datasets Yes The dataset we used for both tasks was the replay buffer generated from a SAC agent trained to convergence, which we set as one million timesteps per environment. This is the standard dataset size used in offline RL (Fu et al., 2020), although of higher quality.
Dataset Splits No The paper describes generating datasets from a SAC agent and using existing benchmarks, but does not specify explicit train/validation/test splits (e.g., percentages or exact counts) for the data used in experiments.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications.
Software Dependencies No The paper mentions various algorithms and environments (e.g., SAC, DADS, OpenAI Gym, MuJoCo) and implies the use of common ML frameworks in section F on hyperparameters. However, it does not provide specific version numbers for any software dependencies (e.g., PyTorch version, Python version, specific library versions).
Experiment Setup Yes For all algorithms (as applicable) and environments, we use the following hyperparameters, roughly based on common values for the parameters in other works (note we classify the skill-practice distribution as a policy/critic here): Discount factor γ equal to 0.99 Replay buffer D size of 106 Dynamics model with three hidden layers of size 256 using tanh activations Dynamics model ensemble size of 4 and learning rate 10-3, training every 250 timesteps Policy and critics with two hidden layers of size 256 using Re LU activations Discriminator with two hidden layers of size 512 using Re LU activations Policy, critic, discriminator learning rates of 3 10-4 training every timestep Automatic entropy tuning for SAC Batch size of 256 for gradient updates MPC population size S set to 400 MPC planning iterations P set to 10 MPC number of particles for expectation calculation set to 20 MPC temperature of 0.01 MPC noise standard deviation of 1 For PETS, we use a planning horizon of either 25 or 180, as mentioned For DADS, we find it helpful to multiply the intrinsic reward by a factor of 5 for learning (for both DADS and Li SP usage) For Li SP specifically, our hyperparameters are: Planning horizon of 180 Repeat a skill for three consecutive timesteps for planning (not for environment interaction) Replan at every timestep Number of rollouts per iteration M set to 400 Generated replay buffer ˆD size set to 5000 Number of prior samples set to 16 in the denominator of the intrinsic reward Number of discriminator updates per iteration set to 4 Number of policy updates per iteration set to 8 Number of skill-practice updates per iteration set to 4 Disagreement threshold αthres set to 0.05 for Hopper, 0.1 for Ant Disagreement penalty κ set to 30