Exploration by Learning Diverse Skills through Successor State Representations
Authors: Paul-Antoine LE TOLGUENEC, Yann BESSE, Florent Teichteil-Koenigsbuch, Dennis Wilson, Emmanuel Rachelson
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our approach on a set of maze navigation and robotic control tasks which show that our method is capable of constructing a diverse set of skills which exhaustively cover the state space without relying on reward or exploration bonuses. |
| Researcher Affiliation | Collaboration | Paul-Antoine Le Tolguenec ISAE-Supaero, Airbus paul-antoine.le-tolguenec@airbus.com Yann Besse Airbus yann.besse@airbus.com Florent Teichteil-Konigsbuch Airbus florent.teichteil-konigsbuch@airbus.com Dennis G. Wilson ISAE-Supaero, Université de Toulouse dennis.wilson@isae-supaero.fr Emmanuel Rachelson ISAE-Supaero, Université de Toulouse emmanuel.rachelson@isae-supaero.fr |
| Pseudocode | Yes | Algorithm 1 LEADS Initialize θ0 for t [0, N] do # Collect samples Dz = , z Z for e [1, nep] do Sample skill z p(z) {(st, at, rt, s t)} = episode with πθt( , z) from s0 Dz = Dz {(st, at, rt, s t)} end for # Learn the SSR Learn mϕt for πθt using on-policy C-learning Sample s δ(s|z) # Improve θ for i [1, n SGD] do Sample z p(z), s1 p(s|z) θ θ + α θ[G(θ) + λh H(θ)] Update ϕt using off-policy C-learning end for end for |
| Open Source Code | Yes | We provide all code for LEADS and the baseline algorithms, as well as the scripts to reproduce the experiments (repository). |
| Open Datasets | Yes | We evaluate LEADS on a variety of Mu Jo Co [42] environments from different benchmark suites. Fetch-Reach [37] is a 7-Do F (degrees of freedom) robotic arm equipped with a two-fingered parallel gripper; its observation space is 10-dimensional. Fetch-Slide extends the former with a puck placed on a platform in front of the arm, increasing the observation space dimension to 25. Hand [37] is a 24-Do F anthropomorphic robotic hand, with a 63-dimensional observation space. Finger [44] a 3-Do F, 12-dimensional observation space, manipulation environment where a planar finger is required to rotate an object on an unactuated hinge. |
| Dataset Splits | No | The paper describes training on various reinforcement learning environments and evaluates performance across tasks, but does not specify explicit train/validation/test dataset splits in terms of percentages, counts, or predefined partition files. |
| Hardware Specification | No | This work was performed using HPC resources from CALMIP (Grant 2016-[p21001]). |
| Software Dependencies | No | The paper mentions software like Mu Jo Co [42] and Gymnasium suite [43], but it does not provide specific version numbers for these or any other key software dependencies required to replicate the experiment. |
| Experiment Setup | Yes | The following table (Table 3) summarizes the hyperparameters used in our experimental setup. Hyperparameter Value nskill 6 zdim 20 λh 0.05 γ 0.95 λc-learning 0.5 αθ 5 10 4 αc-learning 5 10 4 nepisode 16 n SGD, c-learning 256 n SGD, actor 16 narchive 1 batch sizec-learning 1024 batch sizeloss 1024 Table 3: Hyperparameters used for LEADS |