Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

Authors: Leander Diaz-Bone, Marco Bagatella, Jonas Hübotter, Andreas Krause

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then perform a thorough evaluation in high-dimensional environments. We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL. We evaluate the empirical performance of DISCOVER across three complex, sparse-reward, long-horizon control tasks, highlighting five main insights. For all experiments, we report the mean performance across 10 seeds along with its standard error. Additional implementation details, hyperparameter choices and experimental results are reported in Appendices D, E and E.3, respectively. The code is available at https://github.com/LeanderDiazBone/discover.
Researcher Affiliation Academia 1ETH Zürich, Switzerland 2Max Planck Institute for Intelligent Systems, Germany
Pseudocode Yes Algorithm 1 Goal-conditioned Reinforcement Learning
Open Source Code Yes The code is available at https://github.com/LeanderDiazBone/discover.
Open Datasets No Environments For our experiments, we use the Jax GCRL library [9] to assess performance on challenging, high-dimensional navigation and manipulation tasks. Specifically, we evaluate on the antmaze environment, where a simulated quadruped with a 27-dimensional state space and an 8-dimensional action space must learn to navigate through a maze to reach a target location. We additionally implement the pointmaze environment (left), which allows for arbitrary dimensionality. No additional datasets were used for the experiments of this paper.
Dataset Splits No The paper utilizes reinforcement learning environments (antmaze, arm, pointmaze) for evaluation, not traditional static datasets that are typically divided into training, validation, and test splits. The 'simple' and 'hard' configurations refer to task difficulty levels within these environments, not predefined data splits.
Hardware Specification Yes All experiments were run on an internal cluster using a single NVIDIA Ge Force RTX 2080 Ti GPU for each run. The experiments require at most 16GB of memory.
Software Dependencies No The paper mentions using the 'TD3 actor-critic algorithm' and the 'Jax GCRL library [9]', but does not specify version numbers for these or any other key software components (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Table 2: Hyperparameters for training in Jax GCRL Environments. Hyperparameter Value: Offline RL algorithm TD3, Ensemble size 6, Discount factor 0.99, Batch size 256, Learning rate 3e-4, Policy update delay 2, Target critic Polyak factor 0.005, Relabel strategy Uniform future: 70%, original: 30%, Target critic computation Minimum of two random target critics, Size of critic ensemble 6, Initial apdation parameter α0 0, Horizon 100-250, Parameter adaptation lookback kadapt 64-128.