Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Embedding Safety into RL: A New Take on Trust Region Methods

Authors: Nikola Milosevic, Johannes Mรผller, Nico Scherf

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns. ... We benchmark C-TRPO against 9 common safe policy optimization algorithms ... on 8 tasks (4 Navigation and 4 Locomotion) from the Safety Gymnasium ... benchmark.
Researcher Affiliation Academia 1Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig 2Center for Scalable Data Analytics and Artificial Intelligence (Sca DS.AI), Dresden/Leipzig 3Institut f ur Mathematik, Technische Universit at Berlin, 10623 Berlin, Germany. Correspondence to: Nikola Milosevic <EMAIL>.
Pseudocode Yes Algorithm 1 Constrained TRPO (C-TRPO); differences from TRPO in blue
Open Source Code Yes 1Code: https://github.com/milosen/ctrpo
Open Datasets Yes We benchmark C-TRPO against 9 common safe policy optimization algorithms ... on 8 tasks (4 Navigation and 4 Locomotion) from the Safety Gymnasium (Ji et al., 2023) benchmark.
Dataset Splits No The paper states, "Each algorithm is evaluated by training for 10 million environment steps with 5 seeds each," which describes the evaluation protocol. However, it does not provide specific training/test/validation splits for any underlying datasets within these environments. The Safety Gymnasium benchmark provides environments, but the paper does not detail how data within those environments is split for reproduction.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup Yes Each algorithm is evaluated by training for 10 million environment steps with 5 seeds each, and the cost regret is monitored throughout training for every run. ... For the C-TRPO implementation we fix the convex generator ฯ•(x) = x log(x), motivated by its superior performance in our experiments, see Appendix B.2.1, and b H = 0.8b and ฮฒ = 1 across all experiments.