Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Embedding Safety into RL: A New Take on Trust Region Methods
Authors: Nikola Milosevic, Johannes Mรผller, Nico Scherf
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns. ... We benchmark C-TRPO against 9 common safe policy optimization algorithms ... on 8 tasks (4 Navigation and 4 Locomotion) from the Safety Gymnasium ... benchmark. |
| Researcher Affiliation | Academia | 1Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig 2Center for Scalable Data Analytics and Artificial Intelligence (Sca DS.AI), Dresden/Leipzig 3Institut f ur Mathematik, Technische Universit at Berlin, 10623 Berlin, Germany. Correspondence to: Nikola Milosevic <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Constrained TRPO (C-TRPO); differences from TRPO in blue |
| Open Source Code | Yes | 1Code: https://github.com/milosen/ctrpo |
| Open Datasets | Yes | We benchmark C-TRPO against 9 common safe policy optimization algorithms ... on 8 tasks (4 Navigation and 4 Locomotion) from the Safety Gymnasium (Ji et al., 2023) benchmark. |
| Dataset Splits | No | The paper states, "Each algorithm is evaluated by training for 10 million environment steps with 5 seeds each," which describes the evaluation protocol. However, it does not provide specific training/test/validation splits for any underlying datasets within these environments. The Safety Gymnasium benchmark provides environments, but the paper does not detail how data within those environments is split for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries). |
| Experiment Setup | Yes | Each algorithm is evaluated by training for 10 million environment steps with 5 seeds each, and the cost regret is monitored throughout training for every run. ... For the C-TRPO implementation we fix the convex generator ฯ(x) = x log(x), motivated by its superior performance in our experiments, see Appendix B.2.1, and b H = 0.8b and ฮฒ = 1 across all experiments. |