Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Off-Policy Learning for High-Dimensional Action Spaces

Authors: Fabian Otto, Philipp Becker, Vien A Ngo, Gerhard Neumann

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate the relevance and importance of these design choices in achieving high performance. Notably, by removing the dependence on the Q-function our method is particularly well suited for environments with complex action spaces, such as the challenging Myo Suite (Vittorio et al., 2022) and dog locomotion tasks in Deep Mind control (DMC) (Tassa et al., 2018), which most standard off-policy actor-critic methods cannot solve. For our experiments, we evaluate Vlearn on various high-dimensional continuous control tasks from Gymnasium (Towers et al., 2023), DMC (Tunyasuvunakool et al., 2020) and Myo Suite (Vittorio et al., 2022).
Researcher Affiliation Collaboration Fabian Otto Microsoft Research EMAIL Philipp Becker Karlsruhe Institute of Technology Ngo Anh Vien Bosch Center for Artificial Intelligence Gerhard Neumann Karlsruhe Institute of Technology
Pseudocode Yes Pseudo-Code is given in Appendix B. Algorithm 1 Pseudo Code for Vlearn
Open Source Code No The pseudo-code of our method, implementation details, and hyperparameters used in our experiments can be found in the Appendix. Readers can replicate our results by following the procedures outlined in these sections.
Open Datasets Yes For our experiments, we evaluate Vlearn on various high-dimensional continuous control tasks from Gymnasium (Towers et al., 2023), DMC (Tunyasuvunakool et al., 2020) and Myo Suite (Vittorio et al., 2022).
Dataset Splits No To train this estimator, they rely on a dataset D = {(st, at, rt, st+1)t=1...N} and a behavioral policy πb( |s) responsible for generating this dataset. Typically, D takes the form of a replay buffer (Lin, 1992), with the corresponding behavior policy πb being a mixture of the historical policies used to populate the buffer.
Hardware Specification Yes All models were trained on an internal cluster on one Nvidia V 100 for approximately 1-3 days, depending on the task.
Software Dependencies No We conducted a random grid search to tune all hyperparameters for both Gymnasium1 (Towers et al., 2023) and DMC2 (Tunyasuvunakool et al., 2020), which are based on Mujoco3 (Todorov et al., 2012).
Experiment Setup Yes Detailed hyperparameter information for all methods can be found in Appendix D. We conducted a random grid search to tune all hyperparameters for both Gymnasium1 (Towers et al., 2023) and DMC2 (Tunyasuvunakool et al., 2020), which are based on Mujoco3 (Todorov et al., 2012).