Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Robust Reinforcement Learning in a Sample-Efficient Setting
Authors: Siemen Herremans, Ali Anwar, Siegfried Mercelis
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results indicate a notable improvement in policy robustness on high-dimensional control tasks, with the auxiliary model enhancing the performance of the learned policy in distorted MDPs, while maintaining the data-efficiency of the base algorithm. Our methodology is also compared against various other robust RL approaches. We further examine how pessimism is achieved by exploring the learned deviation between the proposed auxiliary world model and the nominal model. |
| Researcher Affiliation | Academia | Siemen Herremans EMAIL IDLab Department of Electronics and ICT, Faculty of Applied Engineering University of Antwerp imec Ali Anwar EMAIL IDLab Department of Electronics and ICT, Faculty of Applied Engineering University of Antwerp imec Siegfried Mercelis EMAIL IDLab Department of Electronics and ICT, Faculty of Applied Engineering University of Antwerp imec |
| Pseudocode | Yes | Algorithm 1 RMBPO (Additions in blue) Algorithm 2 Supervised Pessimistic Distribution Learning with an Auxiliary Model |
| Open Source Code | No | Evaluation code and weights available at https://github.com/rmbpo-eval/rmbpo-tmlr ...The authors are not able to release the full source code of RMBPO at the time of submission of this paper, however, the reader is encouraged to contact the first author of this work with any related questions. |
| Open Datasets | Yes | Secondly (ii), we evaluate the empirical performance of our algorithm on high-dimensional Multiple Joint Control (Mu Jo Co) and Deepmind Control Suite (DMC) benchmarks under simultaneous parameter distortions |
| Dataset Splits | No | The paper describes an agent interacting with environments (Mu Jo Co and DMC), which means the datasets are dynamically generated through these interactions. The concept of fixed training/testing/validation splits, as defined in supervised learning for static datasets, is not applicable in this context. The evaluation is performed on distorted versions of the training environment. |
| Hardware Specification | Yes | Experiments were run on a Ubuntu20.04 (Docker) machine with a single NVIDIA Quadro RTX4000 GPU, two CPU cores, and 38GB of memory. |
| Software Dependencies | No | The paper mentions "Ubuntu20.04 (Docker)" as the operating system and containerization platform but does not specify versions for critical software libraries, frameworks (e.g., Python, JAX/PyTorch/TensorFlow), or other dependencies essential for replicating the experiment environment beyond the OS. |
| Experiment Setup | Yes | Table 2: Hyperparameters Hyperparameter Hopper-v4 Walker2d-v4 Half Cheetah-v4 DMC Walker Ρ 4 0.5 0.25 / 0.5 0.25 Νa 1e-4 1e-4 1e-4 1e-4 Total environment steps 125k 300k 400k 200k |