Dynamics-Aware Unsupervised Discovery of Skills

Authors: Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, Karol Hausman

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through our experiments, we aim to demonstrate that: (a) DADS as a general purpose skill discovery algorithm can scale to high-dimensional problems; (b) discovered skills are amenable to hierarchical composition and; (c) not only is planning in the learned latent space feasible, but it is competitive to strong baselines. In Section 6.1, we provide visualizations and qualitative analysis of the skills learned using DADS. We demonstrate in Section 6.2 and Section 6.4 that optimizing the primitives for predictability renders skills more amenable to temporal composition that can be used for Hierarchical RL.We benchmark against state-of-the-art model-based RL baseline in Section 6.3, and against goal-conditioned RL in Section 6.5.
Researcher Affiliation Industry Archit Sharma , Shixiang Gu, Sergey Levine, Vikash Kumar, Karol Hausman Google Brain {architsh,shanegu,slevine,vikashplus,karolhausman}@google.com
Pseudocode Yes Algorithm 1: Dynamics-Aware Discovery of Skills (DADS)
Open Source Code Yes We have open-sourced our implementation at: https://github.com/google-research/dads
Open Datasets Yes We use the Mu Jo Co environments (Todorov et al., 2012) from the Open AI gym as our test-bed (Brockman et al., 2016).
Dataset Splits No The paper describes unsupervised pre-training and testing on new tasks, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts in the main text.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No All of our models are written in the open source TensorFlow-Agents (Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo, 2018), based on TensorFlow (Abadi et al., 2015). While it mentions software names, it does not specify version numbers for these or other libraries.
Experiment Setup Yes We use SAC as the optimizer for our agent π(a | s, z), in particular, EC-SAC (Haarnoja et al., 2018b). The hidden layer sizes can vary from (128, 128) for Half-Cheetah to (512, 512) for Ant and (1024, 1024) for Humanoid. The critic Q(s, a, z) is similarly parameterized. The target function for critic Q is updated every iteration using a soft updates with co-efficient of 0.005. We use Adam (Kingma & Ba, 2014) optimizer with a fixed learning rate of 3e 4 , and a fixed initial entropy co-efficient β = 0.1. The episode horizon is generally kept shorter for stable agents like Ant (200), while longer for unstable agents like Humanoid (1000). The optimization scheme is on-policy, and we collect 2000 steps for Ant and 4000 steps for Humanoid in one iteration. The batch size is 128, and we carry out 32 steps of gradient descent. For continuous spaces, we set L = 500. After the intrinsic reward is computed, the policy and critic networks are updated for 128 steps with a batch size of 128. For evaluation, we fix the episode horizon to 200 for all models in all evaluation setups. For HP = 1, HZ = 10 and a 2D latent space, we use 50 samples from the planning distribution P. The co-efficient γ for MPPI is fixed to 10. For sparse reward navigation it is important to have a longer horizon planning, in which case we set HP = 4, HZ = 25 with a higher number of samples from the planning distribution (200 from P). For hierarchical controllers being learnt on top of low-level unsupervised primitives, we use PPO (Schulman et al., 2017) for discrete action skills, while we use SAC for continuous skills. We keep the number of steps after which the meta-action is decided as 10 (that is HZ = 10). The hidden layer sizes of the meta-controller are (128, 128). We use a learning rate of 1e 4 for PPO and 3e 4 for SAC. For our model-based RL baseline PETS, we use an ensemble size of 3, with a fixed planning horizon of 20. For the model, we use a neural network with two hidden layers of size 400.