Lipschitz-constrained Unsupervised Skill Discovery

Authors: Seohong Park, Jongwook Choi, Jaekyeom Kim, Honglak Lee, Gunhee Kim

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments on various Mu Jo Co robotic locomotion and manipulation environments, we demonstrate that LSD outperforms previous approaches in terms of skill diversity, state space coverage, and performance on seven downstream tasks including the challenging task of following multiple goals on Humanoid.
Researcher Affiliation Collaboration Seohong Park1 Jongwook Choi 2 Jaekyeom Kim 1 Honglak Lee2,3 Gunhee Kim1 1Seoul National University {artberryx,jaekyeom,gunhee}@snu.ac.kr 2University of Michigan {jwook,honglak}@umich.edu 3LG AI Research
Pseudocode Yes Algorithm 1: Lipschitz-constrained Skill Discovery (LSD)
Open Source Code Yes Our code and videos are available at https://shpark.me/projects/lsd/. ... We make our implementation publicly available in the repository at https://vision.snu. ac.kr/projects/lsd/ and provide the full implementation details in Appendix I.
Open Datasets Yes We compare LSD with multiple previous skill discovery methods on various Mu Jo Co robotic locomotion and manipulation environments (Todorov et al., 2012; Schulman et al., 2016; Plappert et al., 2018) from Open AI Gym (Brockman et al., 2016).
Dataset Splits No The paper evaluates performance within continuous simulation environments and does not describe fixed train/validation/test dataset splits in the traditional sense.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions software like 'garage framework', 'SAC (Haarnoja et al., 2018a)', and 'PPO (Schulman et al., 2017)', but does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes We model each trainable component as a two-layered MLP with 1024 units, and train them with SAC (Haarnoja et al., 2018a). At every epoch, we sample 20 (Ant and Half Cheetah) or 5 (Humanoid) rollouts and train the networks with 4 gradient steps computed from 2048-sized mini-batches. ... For each method, we search the discount factor γ from {0.99, 0.995} and the SAC entropy coefficient α from {0.003, 0.01, 0.03, 0.1, 0.3, 1.0, auto-adjust ...}. We set the default learning rate to 1e 4, but 3e 5 for DADS s q, DIAYN s q and LSD s φ, and 3e 4 for IBOL s low-level policy. ... For downstream tasks, we train a high-level meta-controller on top of a pre-trained skill policy with SAC ... or PPO ... The meta-controller is modeled as an MLP with two hidden layers of 512 dimensions. We set K to 25 (Ant and Half Cheetah) or 125 (Humanoid), the learning rate to 3e 4, the discount factor to 0.995, and use an auto-adjusted entropy coefficient (SAC) or an entropy coefficient of 0.01 (PPO).