Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploration by Learning Diverse Skills through Successor State Representations

Authors: Paul-Antoine LE TOLGUENEC, Yann BESSE, Florent Teichteil-Koenigsbuch, Dennis Wilson, Emmanuel Rachelson

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our approach on a set of maze navigation and robotic control tasks which show that our method is capable of constructing a diverse set of skills which exhaustively cover the state space without relying on reward or exploration bonuses.
Researcher Affiliation Collaboration Paul-Antoine Le Tolguenec ISAE-Supaero, Airbus EMAIL Yann Besse Airbus EMAIL Florent Teichteil-Konigsbuch Airbus EMAIL Dennis G. Wilson ISAE-Supaero, Université de Toulouse EMAIL Emmanuel Rachelson ISAE-Supaero, Université de Toulouse EMAIL
Pseudocode Yes Algorithm 1 LEADS Initialize θ0 for t [0, N] do # Collect samples Dz = , z Z for e [1, nep] do Sample skill z p(z) {(st, at, rt, s t)} = episode with πθt( , z) from s0 Dz = Dz {(st, at, rt, s t)} end for # Learn the SSR Learn mϕt for πθt using on-policy C-learning Sample s δ(s|z) # Improve θ for i [1, n SGD] do Sample z p(z), s1 p(s|z) θ θ + α θ[G(θ) + λh H(θ)] Update ϕt using off-policy C-learning end for end for
Open Source Code Yes We provide all code for LEADS and the baseline algorithms, as well as the scripts to reproduce the experiments (repository).
Open Datasets Yes We evaluate LEADS on a variety of Mu Jo Co [42] environments from different benchmark suites. Fetch-Reach [37] is a 7-Do F (degrees of freedom) robotic arm equipped with a two-fingered parallel gripper; its observation space is 10-dimensional. Fetch-Slide extends the former with a puck placed on a platform in front of the arm, increasing the observation space dimension to 25. Hand [37] is a 24-Do F anthropomorphic robotic hand, with a 63-dimensional observation space. Finger [44] a 3-Do F, 12-dimensional observation space, manipulation environment where a planar finger is required to rotate an object on an unactuated hinge.
Dataset Splits No The paper describes training on various reinforcement learning environments and evaluates performance across tasks, but does not specify explicit train/validation/test dataset splits in terms of percentages, counts, or predefined partition files.
Hardware Specification No This work was performed using HPC resources from CALMIP (Grant 2016-[p21001]).
Software Dependencies No The paper mentions software like Mu Jo Co [42] and Gymnasium suite [43], but it does not provide specific version numbers for these or any other key software dependencies required to replicate the experiment.
Experiment Setup Yes The following table (Table 3) summarizes the hyperparameters used in our experimental setup. Hyperparameter Value nskill 6 zdim 20 λh 0.05 γ 0.95 λc-learning 0.5 αθ 5 10 4 αc-learning 5 10 4 nepisode 16 n SGD, c-learning 256 n SGD, actor 16 narchive 1 batch sizec-learning 1024 batch sizeloss 1024 Table 3: Hyperparameters used for LEADS