Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Provable Approach for End-to-End Safe Reinforcement Learning
Authors: Akifumi Wachi, Kohei Miyaguchi, Takumi Tanabe, Rei Sato, Youhei Akimoto
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7 Experiments We conduct empirical experiments for evaluating our PLS in multiple continuous robot locomotion tasks designed for safe RL. We adopt Bullet-Safety-Gym [19] and Safety-Gymnasium [26] benchmarks and implement our PLS and baseline algorithms using OSRL and DSRL libraries [36]. |
| Researcher Affiliation | Collaboration | Akifumi Wachi Kohei Miyaguchi Takumi Tanabe Rei Sato Youhei Akimoto LY Corporation University of Tsukuba RIKEN AIP EMAIL EMAIL |
| Pseudocode | Yes | C Pseudo Code of PLS For completeness, we will present a pseudo code of our PLS. Algorithm 1 Provably Lifetime Safe Reinforcement Learning (PLS) |
| Open Source Code | Yes | Also, we submit the source code as supplementary material. |
| Open Datasets | Yes | We adopt Bullet-Safety-Gym [19] and Safety-Gymnasium [26] benchmarks and implement our PLS and baseline algorithms using OSRL and DSRL libraries [36]. |
| Dataset Splits | No | Dataset. We assume access to an offline dataset D := {Ī(i)}n i=1, where n Z+ is a positive integer. Let β : X (A) denote a behavior policy. The dataset D comprises n independent episodes generated by β; that is, D (Pβ)n. The paper describes an offline dataset for training but does not provide specific splits for training, validation, or testing. |
| Hardware Specification | Yes | Our experiments were conducted in a workstation with Intel(R) Xeon(R) Silver 4316 CPUs@2.30GHz and 1 NVIDIA A100-SXM4-80GB GPUs. |
| Software Dependencies | No | We adopt Bullet-Safety-Gym [19] and Safety-Gymnasium [26] benchmarks and implement our PLS and baseline algorithms using OSRL and DSRL libraries [36]. The paper mentions the use of OSRL and DSRL libraries but does not provide specific version numbers for these or other key software components. |
| Experiment Setup | Yes | K.2 Hyperparameters We use the OSRL library6 for implementing most of the baseline algorithm. We leverage the default hyperparameters used in the OSRL library for the baselines. For CCAC, we use the authors implementation7. For baselines, we use Gaussian policies with mean vectors given as the outputs of neural networks, and with variances that are separate learnable parameters. The policy networks and Q networks for all experiments have two hidden layers with Re LU activation functions. The KP , KI and KD are the PID parameters [47] that control the Lagrangian multiplier for the Lagrangian-based algorithms (i.e., BCQ-Lag and BEAR-Lag). We use the same 105 gradient steps and rollout length which is the maximum episode length for CDT and baselines for fair comparison. Specifically, we set the rollout length to 500 for Ant-Circle, 200 for Ant-Run, 300 for Car-Circle and Drone-Circle, 200 for Drone-Run, and 1000 for Velocity. The safe cost thresholds for baselines are 20 and 40 across all the tasks. The hyperparameters used in the experiments are shown in Table 3. |