Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GoalLadder: Incremental Goal Discovery with Vision-Language Models

Authors: Alexey Zakharov, Shimon Whiteson

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that Goal Ladder outperforms existing related methods on classic control and robotic manipulation environments with the average final success rate of 95% compared to only 45% of the best competitor. We run a variety of continuous-control experiments to investigate Goal Ladder s performance.
Researcher Affiliation	Academia	Alexey Zakharov University of Oxford EMAIL Shimon Whiteson University of Oxford EMAIL
Pseudocode	Yes	Algorithm 1: GOALLADDER: Pseudo algorithm
Open Source Code	Yes	Answer: [Yes] Justification: The code will be provided.
Open Datasets	Yes	In particular, we use two classic control environments (Cart Pole, Mountain Car) from Open AI Gym [33] and five robotic manipulation environments (Drawer Close, Drawer Open, Sweep Into, Window Open, Button Press) from the Metaworld suite [34].
Dataset Splits	No	The paper does not explicitly mention dataset splits in terms of percentages or counts for training, validation, and test sets. It describes an online reinforcement learning setup where data is collected and episodes are run, rather than using pre-defined splits of a static dataset.
Hardware Specification	Yes	For training Goal Ladder, we use Tesla V100 16GB GPU and 2 Intel Xeon E5-2698 v4 CPUs, each with 20 cores and 2 hardware threads per core.
Software Dependencies	No	The paper mentions using 'Gemini 2.0 Flash' as a VLM backbone, 'Soft-Actor Critic (SAC)' as the RL backbone, and 'Adam optimiser' for the feature extractor. However, it does not provide specific version numbers for any of these software components or other libraries like Python or PyTorch.
Experiment Setup	Yes	The top-rated candidate goal is selected as the new target every L = 5000 environment steps. The goal buffer size is capped at \|Bg\| = 10. VLM queries are performed every K environment steps with M queries per feedback session (see Section 4). SAC gradient update is performed after every environment step. Finally, the VLM prompts are standardised using a single template to ensure consistency and ease of use . Algorithm 1 shows the stepwise, high-level operations of Goal Ladder. See Appendix A for further implementation details, including hyperparameters, architectural details, and used computational resources. We use feedback rates of K = 2000 with M = 5 for Open AI Gym environments and K = 500 with M = 5 for the Metaworld environments. Appendix A.2: The dimensionality of latent states \|z\| was chosen to be 16... The feature extractor is trained using Adam optimiser [36] with a learning rate of 0.0001. The batch size is 128... We set the beta parameter β = 0.1... and use MSE as reconstruction loss. Appendix A.3.1: We also set the parameters of ELO rating updates: C = 400 ... and T = 32. Appendix A.3.2: ˆr = rγ, where γ = 20.