Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Embedding Safety into RL: A New Take on Trust Region Methods
Authors: Nikola Milosevic, Johannes Mรผller, Nico Scherf
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns. ... We benchmark C-TRPO against 9 common safe policy optimization algorithms ... on 8 tasks (4 Navigation and 4 Locomotion) from the Safety Gymnasium ... benchmark. |
| Researcher Affiliation | Academia | 1Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig 2Center for Scalable Data Analytics and Artificial Intelligence (Sca DS.AI), Dresden/Leipzig 3Institut f ur Mathematik, Technische Universit at Berlin, 10623 Berlin, Germany. Correspondence to: Nikola Milosevic <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Constrained TRPO (C-TRPO); differences from TRPO in blue |
| Open Source Code | Yes | 1Code: https://github.com/milosen/ctrpo |
| Open Datasets | Yes | We benchmark C-TRPO against 9 common safe policy optimization algorithms ... on 8 tasks (4 Navigation and 4 Locomotion) from the Safety Gymnasium (Ji et al., 2023) benchmark. |
| Dataset Splits | No | The paper states, "Each algorithm is evaluated by training for 10 million environment steps with 5 seeds each," which describes the evaluation protocol. However, it does not provide specific training/test/validation splits for any underlying datasets within these environments. The Safety Gymnasium benchmark provides environments, but the paper does not detail how data within those environments is split for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries). |
| Experiment Setup | Yes | Each algorithm is evaluated by training for 10 million environment steps with 5 seeds each, and the cost regret is monitored throughout training for every run. ... For the C-TRPO implementation we fix the convex generator ฯ(x) = x log(x), motivated by its superior performance in our experiments, see Appendix B.2.1, and b H = 0.8b and ฮฒ = 1 across all experiments. |