Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Efficient Off-Policy Learning for High-Dimensional Action Spaces
Authors: Fabian Otto, Philipp Becker, Vien A Ngo, Gerhard Neumann
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate the relevance and importance of these design choices in achieving high performance. Notably, by removing the dependence on the Q-function our method is particularly well suited for environments with complex action spaces, such as the challenging Myo Suite (Vittorio et al., 2022) and dog locomotion tasks in Deep Mind control (DMC) (Tassa et al., 2018), which most standard off-policy actor-critic methods cannot solve. For our experiments, we evaluate Vlearn on various high-dimensional continuous control tasks from Gymnasium (Towers et al., 2023), DMC (Tunyasuvunakool et al., 2020) and Myo Suite (Vittorio et al., 2022). |
| Researcher Affiliation | Collaboration | Fabian Otto Microsoft Research EMAIL Philipp Becker Karlsruhe Institute of Technology Ngo Anh Vien Bosch Center for Artificial Intelligence Gerhard Neumann Karlsruhe Institute of Technology |
| Pseudocode | Yes | Pseudo-Code is given in Appendix B. Algorithm 1 Pseudo Code for Vlearn |
| Open Source Code | No | The pseudo-code of our method, implementation details, and hyperparameters used in our experiments can be found in the Appendix. Readers can replicate our results by following the procedures outlined in these sections. |
| Open Datasets | Yes | For our experiments, we evaluate Vlearn on various high-dimensional continuous control tasks from Gymnasium (Towers et al., 2023), DMC (Tunyasuvunakool et al., 2020) and Myo Suite (Vittorio et al., 2022). |
| Dataset Splits | No | To train this estimator, they rely on a dataset D = {(st, at, rt, st+1)t=1...N} and a behavioral policy πb( |s) responsible for generating this dataset. Typically, D takes the form of a replay buffer (Lin, 1992), with the corresponding behavior policy πb being a mixture of the historical policies used to populate the buffer. |
| Hardware Specification | Yes | All models were trained on an internal cluster on one Nvidia V 100 for approximately 1-3 days, depending on the task. |
| Software Dependencies | No | We conducted a random grid search to tune all hyperparameters for both Gymnasium1 (Towers et al., 2023) and DMC2 (Tunyasuvunakool et al., 2020), which are based on Mujoco3 (Todorov et al., 2012). |
| Experiment Setup | Yes | Detailed hyperparameter information for all methods can be found in Appendix D. We conducted a random grid search to tune all hyperparameters for both Gymnasium1 (Towers et al., 2023) and DMC2 (Tunyasuvunakool et al., 2020), which are based on Mujoco3 (Todorov et al., 2012). |