Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Off-Policy Learning for High-Dimensional Action Spaces

Authors: Fabian Otto, Philipp Becker, Vien A Ngo, Gerhard Neumann

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate the relevance and importance of these design choices in achieving high performance. Notably, by removing the dependence on the Q-function our method is particularly well suited for environments with complex action spaces, such as the challenging Myo Suite (Vittorio et al., 2022) and dog locomotion tasks in Deep Mind control (DMC) (Tassa et al., 2018), which most standard off-policy actor-critic methods cannot solve. For our experiments, we evaluate Vlearn on various high-dimensional continuous control tasks from Gymnasium (Towers et al., 2023), DMC (Tunyasuvunakool et al., 2020) and Myo Suite (Vittorio et al., 2022).
Researcher Affiliation	Collaboration	Fabian Otto Microsoft Research EMAIL Philipp Becker Karlsruhe Institute of Technology Ngo Anh Vien Bosch Center for Artificial Intelligence Gerhard Neumann Karlsruhe Institute of Technology
Pseudocode	Yes	Pseudo-Code is given in Appendix B. Algorithm 1 Pseudo Code for Vlearn
Open Source Code	No	The pseudo-code of our method, implementation details, and hyperparameters used in our experiments can be found in the Appendix. Readers can replicate our results by following the procedures outlined in these sections.
Open Datasets	Yes	For our experiments, we evaluate Vlearn on various high-dimensional continuous control tasks from Gymnasium (Towers et al., 2023), DMC (Tunyasuvunakool et al., 2020) and Myo Suite (Vittorio et al., 2022).
Dataset Splits	No	To train this estimator, they rely on a dataset D = {(st, at, rt, st+1)t=1...N} and a behavioral policy πb( \|s) responsible for generating this dataset. Typically, D takes the form of a replay buffer (Lin, 1992), with the corresponding behavior policy πb being a mixture of the historical policies used to populate the buffer.
Hardware Specification	Yes	All models were trained on an internal cluster on one Nvidia V 100 for approximately 1-3 days, depending on the task.
Software Dependencies	No	We conducted a random grid search to tune all hyperparameters for both Gymnasium1 (Towers et al., 2023) and DMC2 (Tunyasuvunakool et al., 2020), which are based on Mujoco3 (Todorov et al., 2012).
Experiment Setup	Yes	Detailed hyperparameter information for all methods can be found in Appendix D. We conducted a random grid search to tune all hyperparameters for both Gymnasium1 (Towers et al., 2023) and DMC2 (Tunyasuvunakool et al., 2020), which are based on Mujoco3 (Todorov et al., 2012).