Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Robust Transfer of Safety-Constrained Reinforcement Learning Agents
Authors: Markel Zubia, Thiago Simão, Nils Jansen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The empirical evaluation shows that this method yields policies that are robust against changes in dynamics, demonstrating safety after transfer to a new environment. |
| Researcher Affiliation | Academia | 1Ruhr Univesity Bochum, Germany 2Eindhoven University of Technology, The Netherlands 3Radboud University Nijmegen, The Netherlands |
| Pseudocode | No | The paper describes the methodology in Section 5 ('ROBUST GUIDED SAFE EXPLORATION') using natural language without presenting any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1The source code is available on https://github.com/ai-fm/safe-and-robust-transfer |
| Open Datasets | Yes | We evaluate our method 1 on benchmark environments created using a framework for safe reinforcement learning called Safety-Gymnasium (Ji et al., 2023). |
| Dataset Splits | Yes | We restrict the uncertainty set to a finite subset ( U ) by discretizing the values of the parameters to m = m1, . . . , m N, and η = η1, . . . , ηN. In our experiments, we use N = 8 values for each parameter by letting mi = (0.5 + i 1 7 )m and ηi = (0.5 + i 1 7 )η for i = 1, . . . , 8, where m and η correspond to the dynamics in the source task. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Safety-Gymnasium' as a framework for benchmark environments but does not provide specific version numbers for it or any other software libraries or dependencies used. |
| Experiment Setup | Yes | A HYPERPARAMETERS The hyperparameters in our method are summarized in Table 1. All actor and critic networks are modeled by a multilayer perceptron (MLP). Parameter M1 M2 M3 Actor network size [256, 256] [256, 256] [256, 256] Critic network size [256, 256] [256, 256] [256, 256] Size of replay buffer 106 106 106 Batch size 256 256 256 Steps per epoch 2000 2000 2000 Number of epochs 106 106 106 Actor learning rate 5 10 6 5 10 6 5 10 6 Critic learning rate 10 3 10 3 10 3 Lambda learning rate 5 10 7 5 10 7 5 10 7 Safety constraint 5 8 25 Table 1: The hyperparameters used in the experiments. |