Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Embedding Safety into RL: A New Take on Trust Region Methods

Authors: Nikola Milosevic, Johannes Müller, Nico Scherf

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns. ... We benchmark C-TRPO against 9 common safe policy optimization algorithms ... on 8 tasks (4 Navigation and 4 Locomotion) from the Safety Gymnasium ... benchmark.
Researcher Affiliation	Academia	1Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig 2Center for Scalable Data Analytics and Artificial Intelligence (Sca DS.AI), Dresden/Leipzig 3Institut f ur Mathematik, Technische Universit at Berlin, 10623 Berlin, Germany. Correspondence to: Nikola Milosevic <EMAIL>.
Pseudocode	Yes	Algorithm 1 Constrained TRPO (C-TRPO); differences from TRPO in blue
Open Source Code	Yes	1Code: https://github.com/milosen/ctrpo
Open Datasets	Yes	We benchmark C-TRPO against 9 common safe policy optimization algorithms ... on 8 tasks (4 Navigation and 4 Locomotion) from the Safety Gymnasium (Ji et al., 2023) benchmark.
Dataset Splits	No	The paper states, "Each algorithm is evaluated by training for 10 million environment steps with 5 seeds each," which describes the evaluation protocol. However, it does not provide specific training/test/validation splits for any underlying datasets within these environments. The Safety Gymnasium benchmark provides environments, but the paper does not detail how data within those environments is split for reproduction.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup	Yes	Each algorithm is evaluated by training for 10 million environment steps with 5 seeds each, and the cost regret is monitored throughout training for every run. ... For the C-TRPO implementation we fix the convex generator ϕ(x) = x log(x), motivated by its superior performance in our experiments, see Appendix B.2.1, and b H = 0.8b and β = 1 across all experiments.