Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Provably Safe Reinforcement Learning with Step-wise Violation Constraints
Authors: Nuoya Xiong, Yihan Du, Longbo Huang
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the superiority of our algorithms in safety performance and corroborate our theoretical results. |
| Researcher Affiliation | Academia | 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2University of Illinois at Urbana-Champaign |
| Pseudocode | Yes | Algorithm 1 SUCBVI Algorithm 2 SRF-UCRL |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | No | The paper describes using 'a custom MDP environment' and 'a grid world environment' for experiments, but provides no links, DOIs, or formal citations to make these datasets/environments publicly accessible. |
| Dataset Splits | No | The paper does not specify exact percentages or absolute sample counts for training, validation, or test splits. It mentions total steps/episodes but no data partitioning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python version, library versions). |
| Experiment Setup | Yes | For our Safe-RL-SW experiment, we set the number of total steps T = 500000. In each episode, the agent gets reward 10 if it arrives at the goal state at the end, and 0 otherwise. For all states except the goal state, c(s) = 0. For the goal state, c(goal) = 0.5. We consider two unsafe states with c(unsafe1) = 0.6 and c(unsafe2) = 0.7. The safety threshold τ = 0.5. For SUCBVI, we set the confidence level δ = 0.05. For UCBVI, we set its parameter to be 0.05. For Opt CMDP-bonus, we set its parameter to be 0.05. For Triple-Q, we set its parameter to be 0.05. For Optpess, we set its parameter to be 0.05. We also tune the learning rate for these algorithms. For Safe-RFE-SW, we set the total number of episodes K = 50000, and ε = 0.1, δ = 0.05. We use a grid world environment with 25 states, 4 actions, and 10 horizon. There is one goal state with c(goal) = 0.5, one unsafe state with c(unsafe) = 0.6. The safety threshold τ = 0.5. |