reproducibilityindex.ai

Dynamics-Aware Unsupervised Discovery of Skills

Authors: Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, Karol Hausman

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through our experiments, we aim to demonstrate that: (a) DADS as a general purpose skill discovery algorithm can scale to high-dimensional problems; (b) discovered skills are amenable to hierarchical composition and; (c) not only is planning in the learned latent space feasible, but it is competitive to strong baselines. In Section 6.1, we provide visualizations and qualitative analysis of the skills learned using DADS. We demonstrate in Section 6.2 and Section 6.4 that optimizing the primitives for predictability renders skills more amenable to temporal composition that can be used for Hierarchical RL.We benchmark against state-of-the-art model-based RL baseline in Section 6.3, and against goal-conditioned RL in Section 6.5.
Researcher Affiliation	Industry	Archit Sharma , Shixiang Gu, Sergey Levine, Vikash Kumar, Karol Hausman Google Brain {architsh,shanegu,slevine,vikashplus,karolhausman}@google.com
Pseudocode	Yes	Algorithm 1: Dynamics-Aware Discovery of Skills (DADS)
Open Source Code	Yes	We have open-sourced our implementation at: https://github.com/google-research/dads
Open Datasets	Yes	We use the Mu Jo Co environments (Todorov et al., 2012) from the Open AI gym as our test-bed (Brockman et al., 2016).
Dataset Splits	No	The paper describes unsupervised pre-training and testing on new tasks, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts in the main text.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	All of our models are written in the open source TensorFlow-Agents (Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo, 2018), based on TensorFlow (Abadi et al., 2015). While it mentions software names, it does not specify version numbers for these or other libraries.
Experiment Setup	Yes	We use SAC as the optimizer for our agent π(a \| s, z), in particular, EC-SAC (Haarnoja et al., 2018b). The hidden layer sizes can vary from (128, 128) for Half-Cheetah to (512, 512) for Ant and (1024, 1024) for Humanoid. The critic Q(s, a, z) is similarly parameterized. The target function for critic Q is updated every iteration using a soft updates with co-efﬁcient of 0.005. We use Adam (Kingma & Ba, 2014) optimizer with a ﬁxed learning rate of 3e 4 , and a ﬁxed initial entropy co-efﬁcient β = 0.1. The episode horizon is generally kept shorter for stable agents like Ant (200), while longer for unstable agents like Humanoid (1000). The optimization scheme is on-policy, and we collect 2000 steps for Ant and 4000 steps for Humanoid in one iteration. The batch size is 128, and we carry out 32 steps of gradient descent. For continuous spaces, we set L = 500. After the intrinsic reward is computed, the policy and critic networks are updated for 128 steps with a batch size of 128. For evaluation, we ﬁx the episode horizon to 200 for all models in all evaluation setups. For HP = 1, HZ = 10 and a 2D latent space, we use 50 samples from the planning distribution P. The co-efﬁcient γ for MPPI is ﬁxed to 10. For sparse reward navigation it is important to have a longer horizon planning, in which case we set HP = 4, HZ = 25 with a higher number of samples from the planning distribution (200 from P). For hierarchical controllers being learnt on top of low-level unsupervised primitives, we use PPO (Schulman et al., 2017) for discrete action skills, while we use SAC for continuous skills. We keep the number of steps after which the meta-action is decided as 10 (that is HZ = 10). The hidden layer sizes of the meta-controller are (128, 128). We use a learning rate of 1e 4 for PPO and 3e 4 for SAC. For our model-based RL baseline PETS, we use an ensemble size of 3, with a ﬁxed planning horizon of 20. For the model, we use a neural network with two hidden layers of size 400.