Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning
Authors: Leander Diaz-Bone, Marco Bagatella, Jonas Hübotter, Andreas Krause
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then perform a thorough evaluation in high-dimensional environments. We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL. We evaluate the empirical performance of DISCOVER across three complex, sparse-reward, long-horizon control tasks, highlighting five main insights. For all experiments, we report the mean performance across 10 seeds along with its standard error. Additional implementation details, hyperparameter choices and experimental results are reported in Appendices D, E and E.3, respectively. The code is available at https://github.com/LeanderDiazBone/discover. |
| Researcher Affiliation | Academia | 1ETH Zürich, Switzerland 2Max Planck Institute for Intelligent Systems, Germany |
| Pseudocode | Yes | Algorithm 1 Goal-conditioned Reinforcement Learning |
| Open Source Code | Yes | The code is available at https://github.com/LeanderDiazBone/discover. |
| Open Datasets | No | Environments For our experiments, we use the Jax GCRL library [9] to assess performance on challenging, high-dimensional navigation and manipulation tasks. Specifically, we evaluate on the antmaze environment, where a simulated quadruped with a 27-dimensional state space and an 8-dimensional action space must learn to navigate through a maze to reach a target location. We additionally implement the pointmaze environment (left), which allows for arbitrary dimensionality. No additional datasets were used for the experiments of this paper. |
| Dataset Splits | No | The paper utilizes reinforcement learning environments (antmaze, arm, pointmaze) for evaluation, not traditional static datasets that are typically divided into training, validation, and test splits. The 'simple' and 'hard' configurations refer to task difficulty levels within these environments, not predefined data splits. |
| Hardware Specification | Yes | All experiments were run on an internal cluster using a single NVIDIA Ge Force RTX 2080 Ti GPU for each run. The experiments require at most 16GB of memory. |
| Software Dependencies | No | The paper mentions using the 'TD3 actor-critic algorithm' and the 'Jax GCRL library [9]', but does not specify version numbers for these or any other key software components (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Table 2: Hyperparameters for training in Jax GCRL Environments. Hyperparameter Value: Offline RL algorithm TD3, Ensemble size 6, Discount factor 0.99, Batch size 256, Learning rate 3e-4, Policy update delay 2, Target critic Polyak factor 0.005, Relabel strategy Uniform future: 70%, original: 30%, Target critic computation Minimum of two random target critics, Size of critic ensemble 6, Initial apdation parameter α0 0, Horizon 100-250, Parameter adaptation lookback kadapt 64-128. |