Exploration by Learning Diverse Skills through Successor State Representations

Authors: Paul-Antoine LE TOLGUENEC, Yann BESSE, Florent Teichteil-Koenigsbuch, Dennis Wilson, Emmanuel Rachelson

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our approach on a set of maze navigation and robotic control tasks which show that our method is capable of constructing a diverse set of skills which exhaustively cover the state space without relying on reward or exploration bonuses.
Researcher Affiliation Collaboration Paul-Antoine Le Tolguenec ISAE-Supaero, Airbus paul-antoine.le-tolguenec@airbus.com Yann Besse Airbus yann.besse@airbus.com Florent Teichteil-Konigsbuch Airbus florent.teichteil-konigsbuch@airbus.com Dennis G. Wilson ISAE-Supaero, Université de Toulouse dennis.wilson@isae-supaero.fr Emmanuel Rachelson ISAE-Supaero, Université de Toulouse emmanuel.rachelson@isae-supaero.fr
Pseudocode Yes Algorithm 1 LEADS Initialize θ0 for t [0, N] do # Collect samples Dz = , z Z for e [1, nep] do Sample skill z p(z) {(st, at, rt, s t)} = episode with πθt( , z) from s0 Dz = Dz {(st, at, rt, s t)} end for # Learn the SSR Learn mϕt for πθt using on-policy C-learning Sample s δ(s|z) # Improve θ for i [1, n SGD] do Sample z p(z), s1 p(s|z) θ θ + α θ[G(θ) + λh H(θ)] Update ϕt using off-policy C-learning end for end for
Open Source Code Yes We provide all code for LEADS and the baseline algorithms, as well as the scripts to reproduce the experiments (repository).
Open Datasets Yes We evaluate LEADS on a variety of Mu Jo Co [42] environments from different benchmark suites. Fetch-Reach [37] is a 7-Do F (degrees of freedom) robotic arm equipped with a two-fingered parallel gripper; its observation space is 10-dimensional. Fetch-Slide extends the former with a puck placed on a platform in front of the arm, increasing the observation space dimension to 25. Hand [37] is a 24-Do F anthropomorphic robotic hand, with a 63-dimensional observation space. Finger [44] a 3-Do F, 12-dimensional observation space, manipulation environment where a planar finger is required to rotate an object on an unactuated hinge.
Dataset Splits No The paper describes training on various reinforcement learning environments and evaluates performance across tasks, but does not specify explicit train/validation/test dataset splits in terms of percentages, counts, or predefined partition files.
Hardware Specification No This work was performed using HPC resources from CALMIP (Grant 2016-[p21001]).
Software Dependencies No The paper mentions software like Mu Jo Co [42] and Gymnasium suite [43], but it does not provide specific version numbers for these or any other key software dependencies required to replicate the experiment.
Experiment Setup Yes The following table (Table 3) summarizes the hyperparameters used in our experimental setup. Hyperparameter Value nskill 6 zdim 20 λh 0.05 γ 0.95 λc-learning 0.5 αθ 5 10 4 αc-learning 5 10 4 nepisode 16 n SGD, c-learning 256 n SGD, actor 16 narchive 1 batch sizec-learning 1024 batch sizeloss 1024 Table 3: Hyperparameters used for LEADS