Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robust Reinforcement Learning in a Sample-Efficient Setting

Authors: Siemen Herremans, Ali Anwar, Siegfried Mercelis

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results indicate a notable improvement in policy robustness on high-dimensional control tasks, with the auxiliary model enhancing the performance of the learned policy in distorted MDPs, while maintaining the data-efficiency of the base algorithm. Our methodology is also compared against various other robust RL approaches. We further examine how pessimism is achieved by exploring the learned deviation between the proposed auxiliary world model and the nominal model.
Researcher Affiliation	Academia	Siemen Herremans EMAIL IDLab Department of Electronics and ICT, Faculty of Applied Engineering University of Antwerp imec Ali Anwar EMAIL IDLab Department of Electronics and ICT, Faculty of Applied Engineering University of Antwerp imec Siegfried Mercelis EMAIL IDLab Department of Electronics and ICT, Faculty of Applied Engineering University of Antwerp imec
Pseudocode	Yes	Algorithm 1 RMBPO (Additions in blue) Algorithm 2 Supervised Pessimistic Distribution Learning with an Auxiliary Model
Open Source Code	No	Evaluation code and weights available at https://github.com/rmbpo-eval/rmbpo-tmlr ...The authors are not able to release the full source code of RMBPO at the time of submission of this paper, however, the reader is encouraged to contact the first author of this work with any related questions.
Open Datasets	Yes	Secondly (ii), we evaluate the empirical performance of our algorithm on high-dimensional Multiple Joint Control (Mu Jo Co) and Deepmind Control Suite (DMC) benchmarks under simultaneous parameter distortions
Dataset Splits	No	The paper describes an agent interacting with environments (Mu Jo Co and DMC), which means the datasets are dynamically generated through these interactions. The concept of fixed training/testing/validation splits, as defined in supervised learning for static datasets, is not applicable in this context. The evaluation is performed on distorted versions of the training environment.
Hardware Specification	Yes	Experiments were run on a Ubuntu20.04 (Docker) machine with a single NVIDIA Quadro RTX4000 GPU, two CPU cores, and 38GB of memory.
Software Dependencies	No	The paper mentions "Ubuntu20.04 (Docker)" as the operating system and containerization platform but does not specify versions for critical software libraries, frameworks (e.g., Python, JAX/PyTorch/TensorFlow), or other dependencies essential for replicating the experiment environment beyond the OS.
Experiment Setup	Yes	Table 2: Hyperparameters Hyperparameter Hopper-v4 Walker2d-v4 Half Cheetah-v4 DMC Walker η 4 0.5 0.25 / 0.5 0.25 λa 1e-4 1e-4 1e-4 1e-4 Total environment steps 125k 300k 400k 200k